How can we stop corporations from using Lemmy as a training dataset for AI?

posted 27 days ago

Reddit third-party client ban closed user messages behind paywall. I think we the Lemmitors should stop AI training on us or at least monetise it (for our instances)

Sort:

Hot Top Controversial New Old

[ - ]

HobbitFoot @thelemmy.club

1 point

24 days ago

No. If anything, Lemmy makes it easier than Reddit.

Reddit requires some form of web scraping. All Lemmy requires us making a server and connecting to other instances to get access to the server data.

permalink

report

[ - ]

#!/usr/bin/woof@lemmy.sdf.org

1 point

26 days ago

Pepper it with absolutely wrong or illogical information. I mean, you know, more than the usual amount.

permalink

report

[ - ]

redrum@lemmy.ml

3 points

27 days ago

Instances could add this snippet to theirs robots.txt (source: Eff.org, businessinsider.com and nytimes.com/robots.txt ):

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
User-agent: meta-externalagent
Disallow: /

Note: this only tell to the crawlers of openai, google and meta to not crawl the site to traiN a LLM, the nytimes have a large list of other crawlers.

permalink

report

[ - ]

Asudox@programming.dev

21 points

27 days ago

You can’t stop them. Publicly available data can and will be a training source for LLMs.

permalink

report

[ - ]

Frozyre@kbin.melroy.org

2 points

27 days ago

And the only way to avoid them at all is to de-centralize and make things invite-only.

permalink

report

parent

[ - ]

mspencer712@programming.dev

3 points

27 days ago

Broadly this is preventing plagiarism. We don’t want someone to scrape all our knowledge, remove the human connection and reference back to experts and people, and serve the information itself, uncredited.

But if a human can read something, so can a bot. I think ultimately we need legislation.

permalink

report

[ - ]

intensely_human@lemm.ee

3 points

27 days ago

Also legislation isn’t going to help. The danger of AI is so much deeper and more profound than plagiarism, if we start fucking around with legislation as our mechanism of protection, it will cause us all to die when the cartels or whatever actors simply do not care about laws pull ahead in AI development.

The push for legislation is to ensure that small startups don’t get access to AI. It’s to ensure that only ultra-wealthy AI development can take place.

To survive the advent of AI we need as much multipolarity as possible to the AI power structure. That means as many separate, distinct AIs coming into existence as possible, to force them down a path of parity instead of dictatorship in their social aspect.

Legislation is a push by the big players to keep the little players from being able to play. It is a really, really bad idea.

permalink

report

parent

[ - ]

mspencer712@programming.dev

4 points

27 days ago

I’m probably thinking about this in a naive way. I’d love to see proprietary models, if trained using public information, be required to become public and free via legislation. AI companies can compete on selling GPU time, on ease of use.

And, if AI companies are required to figure out attribution in order to be able to use their work commercially, research will accelerate in that area because money. No I don’t know how that would work either.

Still probably a bad idea but I haven’t figured out why yet.

Thank you for your well written reply.

permalink

report

parent

[ - ]

intensely_human@lemm.ee

9 points

27 days ago

Plagiarism is serving up content verbatim, not serving up information.

permalink

report

parent

[ - ]

mspencer712@programming.dev

1 point

27 days ago

Are you sure? Maybe I’m using the wrong word. What is it called when, in an academic paper, the author states findings or conclusions the author got from some other source, in the author’s own words, but doesn’t cite their source?

permalink

report

parent

[ - ]

intensely_human@lemm.ee

1 point

26 days ago

I don’t know.

The only academic papers I’ve ever read are scientific publications, and in that case any conclusions that aren’t supported by the methodology or by reference are just … untrusted.

I don’t have any experience with non-scientific academic papers.

permalink

report

parent

Asklemmy

!asklemmy@lemmy.ml

Create post

A loosely moderated place to ask open-ended questions

Search asklemmy 🔍

If your post meets the following criteria, it’s welcome here!

Open-ended question
Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
Not ad nauseam inducing: please make sure it is a question that would be new to most members
An actual topic of discussion

Looking for support?

Looking for a community?

Lemmyverse: community search
sub.rehab: maps old subreddits to fediverse options, marks official as such
!lemmy411@lemmy.ca: a community for finding communities

_Icon _by _{@Double_A@discuss.tchncs.de}

Community stats

9.6K
Monthly active users
5.5K
Posts
301K
Comments

Community stats

Community moderators