Reddit has a new AI training deal to sell user content::Reddit has reportedly made a deal with an unnamed AI company to allow access to its platform’s content for the purposes of AI model training.
FUCK REDDIT! FUCK U/SPEZ! The Red-exit shall endure, VIVA LA LEMMY!!
When spez took away API access, he basically shit on the social contract that offered a fair exchange of free access for the content we fed into reddit. After the API change, there were new terms: there is no contract. There are no terms. If you use reddit now, you are giving away everything you are to be indexed and mangled by statistics. You exist as free labor to statisticians and machines.
You are more than a few cents of bad memes.
I’m going to make the request in the AM that Lemmy should add robots.txt rules to disallow AI crawlers, to at least indicate we’re not interested. We need legislation that tells scrapers what they can access.
I’d be very surprised if people weren’t already scraping Reddit for this.
it’s all but guaranteed. Reminds me of this Computerphile video: https://youtu.be/WO2X3oZEJOA?t=874 TL;DW: there were “glitch tokens” in GPT (and therefore ChatGPT) which undeniably came from Reddit usernames.
Note, there’s no proof that these reddit usernames were in the training data (and there’s even reasons to assume that they weren’t, watch the video for context) but there’s no doubt that OpenAI already had scraped reddit data at some point prior to training, probably mixed in with all the rest of their text data. I see no reason to assume they completely removed all reddit text before training. The video suggest reasons and evidence that they removed certain subreddits, not all of reddit.
Here is an alternative Piped link(s):
https://piped.video/WO2X3oZEJOA?t=874
Piped is a privacy-respecting open-source alternative frontend to YouTube.
I’m open-source; check me out at GitHub.
I mean, there’s /r/SubSimulatorGPT2 that’s been running for years… Although that one was at least hilarious to read because at that stage the AI was in the sweet spot of being simultaneously coherent while making total lapses in logic.
That was the real reason for the API changes last year, apps just got caught in the crossfire.
Can’t wait for chatGPT to call me good sir and tell me I win the internet.
For anyone looking for a gibberish generator to replace their Reddit content with, here’s one. This shit is like poison for those large models.
For automatic edition I’m not sure on what people can use nowadays; back then just before the APIcalypse I’ve used power delete suite, I’m not sure if it still works and I’m not creating a Reddit account just to test it out.
Not that I’m against telling Reddit to fuck off in no uncertain terms, but won’t providing this kind of poisoning to AI training just make it more resilient to exactly this kind of thing?
I don’t think so. It’s really hard to sort the poison out of the data, unless you actually have enough reading comprehension to know that it’s gibberish - humans do, bots don’t. And even if they discard 80% of the poison, the 20% there are already screwing with the model.
They could prevent you from editing your posts/comments, but that would cause an uproar.