Reddit Will License Its Data to Train LLMs, So We Made a Firefox Extension That Lets You Replace Your Comments

[ - ]

originalfrozenbanana@lemm.ee

32 points

7 months ago

Ok but Reddit absolutely saves the old comments though

permalink

report

reply

[ - ]

Jo Miran@lemmy.ml

22 points

7 months ago

You are polluting the data set. Do it a few times with different text sources and the scrubbers won’t know what part of your comment history is good. Replace, don’t delete.

permalink

report

parent

reply

[ - ]

ArbitraryValue@sh.itjust.works

17 points

7 months ago

*

I’m pretty sure they’ll know that the first version of each comment is almost certainly the good one. People sometimes edit a comment to add new information or fix a typo, but they almost never replace nonsense with a good comment, rather than the other way around.

Edit: fixed typos, also replaced excerpt from Moby Dick with this post.

Edit 2: the comments you post here are totally available for machine learning, so I don’t see much of a point in deleting my Reddit comments as long as I’m participating in Lemmy.

permalink

report

parent

reply

[ - ]

Jo Miran@lemmy.ml

3 points

7 months ago

*

Maybe. Almost every comment I make I edit. The key is that by doing this you are inserting the possibility. It is actually easier, and safer, to just filter out edited comments than it is to try to sort out what’s good and what isn’t. The bottom line is that the best course of action is to avoid Reddit at all cost. If you do go there and feel compelled to comment, then coming back the next day to replace your comments a few times is better than “deleting”.

permalink

report

parent

reply

Show more comments

[ - ]

originalfrozenbanana@lemm.ee

6 points

7 months ago

Not in a meaningful way. It’s easy to detect and revert a change like this. Instead of bulk changing all your comments, you should slowly change them over time.

Even then, users don’t usually edit most of their comments. Sure Reddit might be naive and just take the current comments, but it’s pretty trivial to reverse this kind of thing.

Probably good to do it to make this process harder and more error prone for Reddit but I would not be under the impression that this has an impact beyond being annoying.

permalink

report

parent

reply

[ - ]

4am@lemm.ee

2 points

7 months ago

Or it’ll help train the AI to recognize when that happens and more easily parse history for the relevant stuff.

permalink

report

parent

reply

[ - ]

Car@lemmy.dbzer0.com

3 points

7 months ago

It’s already happened last year during the reddit exodus. The AI models either validate the data or not. This has a chance of working, which is better than doing nothing at all.

permalink

report

parent

reply

[ - ]

GBU_28@lemm.ee

1 point

7 months ago

Over a long period sure. If they see a spike where say, 25% of a user’s comments are changed in a day, then they’ll just use day -1

permalink

report

parent

reply

[ - ]

Grimy@lemmy.world

19 points

7 months ago

*

Reddit has a copy of every comment and edit, they probably have copies of things users type but don’t actually end up posting.

It is brutally trivial to notice mass edits like this.

The only thing this is doing is making it harder for people scraping it without paying, making what reddit is selling actually valuable.

Every edited or deleted comment is more money in their pocket.

permalink

report

reply

[ - ]

henfredemars@infosec.pub

14 points

7 months ago

*

Let this be a lesson on generating content for a business and not getting paid for it.

With that said, I’m sure the frog posts are exactly the kind of quality content needed to train an AI.

permalink

report

reply

[ - ]

YerbaYerba@lemm.ee

7 points

7 months ago

Big business is likely scraping our Lemmy comments anyway.

permalink

report

parent

reply

[ - ]

henfredemars@infosec.pub

8 points

7 months ago

Good point. At least it’s available freely to everyone instead of being used to make a profit on the content itself.

permalink

report

parent

reply

[ - ]

Pennomi@lemmy.world

3 points

7 months ago

See I don’t have any issue with data being free. I have issue with corporations hoarding it.

permalink

report

parent

reply

[ - ]

Ultragigagigantic@lemmy.world

1 point

7 months ago

Oh shit it’s still Wednesday!

permalink

report

parent

reply

[ - ]

GregorTacTac@lemm.ee

11 points

7 months ago

My comments on that site are so dumb, ai will not produce any good text after using those as training data.

permalink

report

reply

[ - ]

blindbunny@lemmy.ml

13 points

7 months ago

They’ll gladly use that data, it makes the ai more human

permalink

report

parent

reply

[ - ]

rustydrd@sh.itjust.works

2 points

7 months ago

Came here to say this.

This comment was powered by ChatGPT 4.5

permalink

report

parent

reply

[ - ]

bronzle@lemm.ee

2 points

7 months ago

This.

powered by redditGPT

permalink

report

parent

reply

[ - ]

hypnicjerk@lemmy.world

11 points

7 months ago

are there copyrighted texts that have such distinctive patterns that they would be particularly easy to spot in an LLM’s output? say, would replacing every comment with a page from moby dick or wuthering heights be more or less infringing than using harry potter? hypothetically.

permalink

report

reply

[ - ]

dual_sport_dork 🐧🗡️@lemmy.world

16 points

7 months ago

Well, I’m pretty sure Moby Dick is in the public domain by now. If I were you I’d go for something from Disney which is mathematically certain to get somebody sued although I can’t predict who.

permalink

report

parent

reply

Reddit Will License Its Data to Train LLMs, So We Made a Firefox Extension That Lets You Replace Your Comments(theluddite.org)

Bye Reddit

!byereddit@lemmy.world

Community stats

Community moderators