Who needs human when you have AI :p
Why would they? There’s plenty of non-AI-generated material to train them off of and it’s something that future trainers will watch out for.
Sure there may be a lot, but it’s still finite. And already, social media is being filled with AI generated content. If the trend continues, human generated content will be dwarfed by AI generated content. And it’s not going to be a simple process to distinguish between the two.
Infinite training data isn’t required.
It’s actually fine to include some AI-generated data in your training set, the reason “model collapse” happens is when you train on only AI-generated content and you end up losing out on some of the less-common outputs. Without the less-common cases in the training data each generation of AI has less diverse information to learn from. If you make sure the training set is diverse enough then it should be fine.
All else fails, just make sure a lot of your data is from before 2023.
I think you misunderstand the problem. Sure it starts with small amounts of output fed into the input, but as it continues to generate large amounts of output, overtime, more and more of the output makes it into the input.
And again, limiting LLMs to pre-2023 training data ensures they never get smarter. Human knowledge expands as LLMs at best are locked into a constant state of 2023 knowledge.