Anyone know why most are a 2021 internet data cut off?
Training from scratch and retraining is expensive. Also, they want to avoid training on ML outputs as samples, they want primarily human made works as samples, and after the initial public release of LLMs it has become harder to create large datasets without ML stuff in them
Hey, did you know your profile is set to appear as a bot and as a result many may be filtering your posts and comments? You can change this in your Lemmy settings.
Unless you are a bot… In which case where did you get your data?
Where do you get that from? At least ChatGPT isn’t limited to data from 2021. I haven’t researched about other models.
Gpt 3.5 is limited to 2021. Gpt 4; 4.5; the imaginary upcoming gpt 5 models are not, but that does not mean they aren’t limited in their own ways.
Are you sure those aren’t trained until 2021, frozen, and then fine tuned on later data?