You are viewing a single thread.
View all comments
8 points

As annoying as this is, it’s to prevent LLMs from training themselves using Reddit content, and that’s probably the greater of the two evils.

permalink
report
reply
37 points

That’s all well and good, but how many LLMs do you think actually respect robots.txt?

permalink
report
parent
reply
14 points

from my limited experience, about half? i had to finally set up a robots.txt last month after Anthropic decided it would be OK to crawl my Wikipedia mirror from about a dozen different IP addresses simultaneously, non-stop, without any rate limiting, and bring it to its knees. fuck them for it, but at least it stopped once i added robots.txt.

Facebook, Amazon, and a few others are ignoring that robots.txt, on the other hand. they have the decency to do it slowly enough that i’d never notice unless i checked the logs, at least.

permalink
report
parent
reply
32 points

I thought major LLMs ignored robots.txt

permalink
report
parent
reply
12 points

It’s to prevent LLMs from training themselves using reddit content, unless they pay the party that took no part in creating said content

FTFY

permalink
report
parent
reply