This has been a concern of mine for a long time. People act like docs and code bases are enough, but it’s obvious when looking up something niche that it isn’t. These models need a lot of input data, and we’re effectively killing the source(s) of new data.
It feels like less stack overflow is a narrowing, and that’s kind of where my question comes from. The remaining content for training is the actual authoritative library documentation source material. I’m not sure that’s necessarily bad, it’s certainly less volume, but it’s probably also higher quality.
I don’t know the answer here, but I think the situation is a lot more nuanced than all of the black and white hot takes.
There’s a serious argument that StackOverflow was, itself, a patch job in a technical environment that lacked good documentation and debug support.
I’d argue the mistake was training on StackExchange to begin with and not using an actual stack of manuals on proper coding written by professionals.
The problem was never having the correct answer but sifting out of the overall pool of information. When ChatGPT isn’t hallucinating, it does that much better than Stack Exchange
So what do we train gpt on when stack overflow degrades?
Will library docs be enough? Maybe.
SO is already degraded because they didn’t allow new answers even though the old answers are based on old depreciated versions and no longer relevant.
This has been a concern of mine for a long time. People act like docs and code bases are enough, but it’s obvious when looking up something niche that it isn’t. These models need a lot of input data, and we’re effectively killing the source(s) of new data.
It feels like less stack overflow is a narrowing, and that’s kind of where my question comes from. The remaining content for training is the actual authoritative library documentation source material. I’m not sure that’s necessarily bad, it’s certainly less volume, but it’s probably also higher quality.
I don’t know the answer here, but I think the situation is a lot more nuanced than all of the black and white hot takes.
There’s a serious argument that StackOverflow was, itself, a patch job in a technical environment that lacked good documentation and debug support.
I’d argue the mistake was training on StackExchange to begin with and not using an actual stack of manuals on proper coding written by professionals.
The problem was never having the correct answer but sifting out of the overall pool of information. When ChatGPT isn’t hallucinating, it does that much better than Stack Exchange