5 points

Does it matter if I’m okay with it? If I use those websites, it’s going to happen. And honestly, it’s probably happening with Lemmy too since any company can scrape it for free. That’s just the sad state of the internet at this point. If you write something anywhere public, it’s probably being used to train AI.

permalink
report
reply
1 point

They don’t even need to scrape it. Set up an ActivityPub endpoint and suck it all in nice and json’d.

permalink
report
parent
reply
6 points

Consider the source of the Reddit comments, Stack Overflow questions and FB photos: a substantial number of them are poorly written, irrelevant, re-posts and were already churned out by AI. So the ouroboros is the master of and source of its own fate. GIGO.

So fill yer boots, AI, it’s your own shit you’re eating.

permalink
report
reply
2 points

Hence Google’s AI giving supposedly real advice based on Reddit joke posts.

permalink
report
parent
reply
6 points

I mean whatever I posted there I posted for public use. Whatever it can learn from my few ancient Stack Overflow posts, it’s welcome to it.

permalink
report
reply
2 points
*

They are using your knowledge and talents to profit immensely by selling or using it. Selling it right back to you and everyone else. I have a problem with that.

It would be fine if it was free.

permalink
report
parent
reply
2 points

In my opinion stack overflow has vague answers 80% of the time. If AI helps filtering all the bullshit, I’m ok with that. And I don’t know how Reddit or Facebook bot comments are a good source of information.

permalink
report
reply
23 points

How much glue should you put in your pizza?

Seriously. Eating Reddit means some truly strange shit is going to emerge. We all know the pattern of fun replies on Reddit, but the AI does not. Hence the glue thing.

permalink
report
reply
-4 points
*

I don’t think that’s an impossible problem. Existing models can reliability distinguish between, for example, different languages. Most of their training data is presumably in English but while this may make them better at generating English text, it doesn’t make them randomly switch from other languages to English. A sufficiently advanced model would likewise distinguish between descriptions of reality and shit-posts because the content of shit-posts would not be useful for predicting descriptions of reality. Some fine tuning would teach it to produce just the descriptions of reality.

Or look at it this way: the folks developing these LLMs aren’t ignorant of the fact that Reddit content is often false and meant to be funny. They’re not going to make the sort of silly mistake that someone who isn’t an expert can still easily predict and they’re not going to train their LLMs on that content if it makes the LLMs worse, although we’re still going to see some glue on pizza while the technology continues to develop.

permalink
report
parent
reply
6 points

It can cross check a language with tons of other words and examples of that language already in its data set. There is no such data for whether or not something confirms with reality. That simply doesn’t exist and really won’t ever exist. They are not similar problems. One is immensely more challenging to solve than the other.

permalink
report
parent
reply
-5 points

Glue is still a more valid pizza topping than PINEAPPLE!!!

permalink
report
parent
reply
2 points

Bring back anchovies.

permalink
report
parent
reply
3 points

Found Zoidberg!

permalink
report
parent
reply
6 points

Pineapple and bacon is the best toppings for pizza to ever exist

permalink
report
parent
reply
3 points

Pineapple and jalapeno together on pizza is pretty dope as well.

permalink
report
parent
reply