OpenAI just admitted it can’t identify AI-generated text. That’s bad for the internet and it could be really bad for AI models.::In January, OpenAI launched a system for identifying AI-generated text. This month, the company scrapped it.

You are viewing a single thread.
View all comments View context
6 points

Not really. If it’s truly impossible to tell the text apart, than it doesn’t really pose a problem for training AI. Otherwise, next-gen AI will be able to tell apart text generated by current gen AI, and it will get filtered out. So only the most recent data will have unfiltered shitty AI-generated stuff, but they don’t train AI on super-recent text anyway.

permalink
report
parent
reply
28 points

This is not the case. Model collapse is a studied phenomenon for LLMs and leads to deteriorating quality when models are trained on the data that comes from themselves. It might not be an issue if there were thousands of models out there but there are only 3-5 base models that all the others are derivatives of IIRC.

permalink
report
parent
reply
1 point
*

I don’t see how that affects my point.

  • Today’s AI detector can’t tell apart the output of today’s LLM.
  • Future AI detector WILL be able to tell apart the output of today’s LLM.
  • Of course, future AI detector won’t be able to tell apart the output of future LLM.

So at any point in time, only recent text could be “contaminated”. The claim that “all text after 2023 is forever contaminated” just isn’t true. Researchers would simply have to be a bit more careful including it.

permalink
report
parent
reply
13 points

Your assertion that a future AI detector will be able to detect current LLM output is dubious. If I give you the sentence “Yesterday I went to the shop and bought some milk and eggs.” There is no way for you or any detection system to tell if that was AI generated or not with any significant degree of certainty. What can be done is statistical analysis of large data sets to see how they “smell”, but saying around 30% of this dataset is likely LLM generated does not get you very far in creating a training set.

I’m not saying that there is no solution to this problem, but blithely waving away the problem saying future AI will be able to spot old AI is not a serious take.

permalink
report
parent
reply
1 point

There is not enough entropy in text to even detect current model output. it’s game over.

permalink
report
parent
reply
1 point

no, they won’t. We have already built the models that we have already built. Any current works in progress are the future ai you are talking about. And we just can’t do it. Openai themselves have admitted that the ones they tried making just didn’t work. And it won’t, because language is not just the statistical correlations between words that have already been written in the past.

permalink
report
parent
reply
1 point

People still tap into real world while AI does not do that yet. Once AI will be able to actively learn from realworld sensors, the problem might disappear, no?

permalink
report
parent
reply
1 point

They already do. where do you think the training corpus comes from? The real world. It’s curated by humans and then fed to the ml system.

Problem is that the real world now has a bunch of text generated by ai. And it has been well studied that feeding that back into the training will destroy your model (because the networks would then effectively be trained to predict their own output, which just doesn’t make sense)

So humans still need to filter that stuff out of the training corpus. But we can’t detect which ones are real and which ones are fake. And neither can a machine. So there’s no way to do this properly.

The data almost always comes from the real world, except now the real world also contains “harmful” (to ai) data that we can’t figure out how to find and remove.

permalink
report
parent
reply

Technology

!technology@lemmy.world

Create post

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


Community stats

  • 18K

    Monthly active users

  • 12K

    Posts

  • 553K

    Comments