lemm.ee

Local All Communities Log in Sign up

Local All Communities

505

2 authors say OpenAI 'ingested' their books to train ChatGPT. Now they're suing, and a 'wave' of similar court cases may follow.(www.businessinsider.com)

posted 1 year ago

by

L4sBot@lemmy.worldMB

in

technology@lemmy.world

Two authors sued OpenAI, accusing the company of violating copyright law. They say OpenAI used their work to train ChatGPT without their consent.

Sort:

Hot Top Controversial New Old

[ +- ]

Meow.tar.gz@lemmy.goblackcat.com

3 points

1 year ago

Too be honest, I hope they win. While I my passion is technology, I am not a fan of artificial intelligence at all! Decision-making is best left up to the human being. I can see where AI has its place like in gaming or some other things but to mainstream it and use it to decide who’s resume is going to be viewed and/or who will be hired; hell no.

report

reply

[ +- ]

HumbertTetere@feddit.de

23 points

1 year ago

use it to decide who’s resume is going to be viewed and/or who will be hired

Luckily that’s far removed from ChatGPT and entirely indepentent from the question whether copyrighted works may be used to train conversational AI Models or not.

report

reply

[ +- ]

Ulu-Mulu-no-die@lemmy.world

1 point

1 year ago

I’m not against artificial intelligence, it could be a very valuable tool, but that’s nowhere near a valid reason to break laws as OpenAI has done, that’s why I too hope authors win.

report

reply

[ +- ]

bioemerl@kbin.social

12 points

1 year ago

What laws are you saying they’ve broken?

report

reply

[ +- ]

Ulu-Mulu-no-die@lemmy.world

1 point

1 year ago

*

Copyright, this is not the first time they’re sued for it apparently (violating copyright is a crime).

report

reply

Show more comments

Show more comments

[ +- ]

Chailles@lemmy.world

12 points

1 year ago

You don’t need AI to unfairly filter out résumés, they’ve been doing it already for years. Also the argument that a human would always make the best decision really doesn’t work that well. A human is biased and limited. They can only do so much and if you make someone go through a 100 résumés, you’re basically just throwing out all the applicants who happen to be in the middle of that pile as they are not as outstanding compared towards the first and last applicants in the eyes of the human mind.

report

reply

[ +- ]

Meow.tar.gz@lemmy.goblackcat.com

-1 points

1 year ago

I get that HR does this shit all of the time. But at least without AI, your resume or CV has a better chance of making it to a human being.

report

reply

[ +- ]

BURN@lemmy.world

2 points

1 year ago

I got a degree with a sub focus in AI and I hate where this has gone extremely fast. It’s not exciting anymore, it’s just depressing. I’m trying to get out of tech sooner rather than later and go live off the grid somewhere.

AI will kill society long before it’ll save it

report

reply

[ +- ]

OldGreyTroll@kbin.social

109 points

1 year ago

If I read a book to inform myself, put my notes in a database, and then write articles, it is called “research”. If I write a computer program to read a book to put the notes in my database, it is called “copyright infringement”. Is the problem that there just isn’t a meatware component? Or is it that the OpenAI computer isn’t going a good enough job of following the “three references” rule to avoid plagiarism?

report

reply

[ +- ]

ash@lemmy.fmhy.ml

-28 points

1 year ago

I honestly do not care whether it is or is not copyright infringment, just hope to see “AI” burn :3

report

reply

[ +- ]

Dav@kbin.social

29 points

1 year ago

AI isnt a boogyman, it’s a set of tools. No chance it’s going away even if Open AI suddenly disappeared.

report

reply

[ +- ]

ash@lemmy.fmhy.ml

-9 points

1 year ago

I understand, but I will continue to stubbornly dislike LLMs.

report

reply

Show more comments

Show more comments

[ +- ]

nlogn@lemmy.world

48 points

1 year ago

Or is it that the OpenAI computer isn’t going a good enough job of following the “three references” rule to avoid plagiarism?

This is exactly the problem, months ago I read that AI could have free access to all public source codes on GitHub without respecting their licenses.

So many developers have decided to abandon GitHub for other alternatives not realizing that in the end AI training can safely access their public repos on other platforms as well.

What should be done is to regulate this training, which however is not convenient for companies because the more data the AI ingests, the more its knowledge expands and “helps” the people who ask for information.

report

reply

[ +- ]

bioemerl@kbin.social

42 points

1 year ago

It’s incredibly convenient for companies.

Big companies like open AI can easily afford to download big data sets from companies like Reddit and deviantArt who already have the permission to freely use whatever work you upload to their website.

Individual creators do not have that ability and the act of doing this regulation will only force AI into the domain of these big companies even more than it already is.

Regulation would be a hideously bad idea that would lock these powerful tools behind the shitty web APIs that nobody has control over but the company in question.

Imagine the world is the future, magical new age technology, and Facebook owns all of it.

Do not allow that to happen.

report

reply

[ +- ]

Kilamaos@lemmy.world

7 points

1 year ago

Plus, any regulation to limit this now means that anyone not already in the game will never breakthrough. It’s going to be the domain of the current players for years, if not decades. So, not sure what’s better, the current wild west where everyone can make something, or it being exclusive to the already big players and them closing the door behind

report

reply

[ +- ]

SirGolan@lemmy.sdf.org

4 points

1 year ago

My concern here is that OpenAI didn’t have to share gpt with the world. These lawsuits are going to discourage companies from doing that in the future, which means well funded companies will just keep it under wraps. Once one of them eventually figures out AGI, they’ll just use it internally until they dominate everything. Suddenly, Mark Zuckerberg is supreme leader and we all have to pledge allegiance to Facebook.

report

reply

[ +- ]

mydataisplain@lemmy.world

17 points

1 year ago

Is it practically feasible to regulate the training? Is it even necessary? Perhaps it would be better to regulate the output instead.

It will be hard to know that any particular GET request is ultimately used to train an AI or to train a human. It’s currently easy to see if a particular output is plagiarized. https://plagiarismdetector.net/ It’s also much easier to enforce. We don’t need to care if or how any particular model plagiarized work. We can just check if plagiarized work was produced.

That could be implemented directly in the software, so it didn’t even output plagiarized material. The legal framework around it is also clear and fairly established. Instead of creating regulations around training we can use the existing regulations around the human who tries to disseminate copyrighted work.

That’s also consistent with how we enforce copyright in humans. There’s no law against looking at other people’s work and memorizing entire sections. It’s also generally legal to reproduce other people’s work (eg for backups). It only potentially becomes illegal if someone distributes it and it’s only plagiarism if they claim it as their own.

report

reply

[ +- ]

Grandwolf319@sh.itjust.works

2 points

1 year ago

This makes perfect sense. Why aren’t they going about it this way then?

My best guess is that maybe they just see openAI being very successful and wanting a piece of that pie? Cause if someone produces something via chatGPT (let’s say for a book) and uses it, what are they chances they made any significant amount of money that you can sue for?

report

reply

Show more comments

Show more comments

[ +- ]

ThoughtGoblin@lemm.ee

2 points

1 year ago

AI could have free access to all public source codes on GitHub without respecting their licenses.

IANAL, but aren’t their licenses are being respected up until they are put into a codebase? At least insomuch as Google is allowed to display code snippets in the preview when you look up a file in a GitHub repo, or you are allowed to copy a snippet to a StackOverflow discussion or ticket comment.

I do agree regulation is a very good idea, in more ways than just citation given the potential economic impacts that we seem clearly unprepared for.

report

reply

[ +- ]

bioemerl@kbin.social

71 points

1 year ago

Yeah. There are valid copyright claims because there are times that chat GPT will reproduce stuff like code line for line over 10 20 or 30 lines which is really obviously a violation of copyright.

However, just pulling in a story from context and then summarizing it? That’s not a copyright violation that’s a book report.

report

reply

[ +- ]

Wander@kbin.social

10 points

1 year ago

*

Say I see a book that sells well. It’s in a language I don’t understand, but I use a thesaurus to replace lots of words with synonyms. I switch some sentences around, and maybe even mix pages from similar books into it. I then go and sell this book (still not knowing what the book actually says).

I would call that copyright infringement. The original book didn’t inspire me, it didn’t teach me anything, and I didn’t add any of my own knowledge into it. I didn’t produce any original work, I simply mixed a bunch of things I don’t understand.

That’s what these language models do.

report

reply

[ +- ]

magic_lobster_party@kbin.social

12 points

1 year ago

The fear is that the books are in one way or another encoded into the machine learning model, and that the model can somehow retrieve excerpts of these books.

Part of the training process of the model is to learn how to plagiarize the text word for word. The training input is basically “guess the next word of this excerpt”. This is quite different compared to how humans do research.

To what extent the books are encoded in the model is difficult to know. OpenAI isn’t exactly open about their models. Can you make ChatGPT print out entire excerpts of a book?

It’s quite a legal gray zone. I think it’s good that this is tried in court, but I’m afraid the court might have too little technical competence to make a ruling.

report

reply

[ +- ]

qwertyqwertyqwerty@lemmy.one

1 point

1 year ago

I’d say the main difference is that AI companies are profiting off of the training material, which seem unethical/illegal.

report

reply

[ +- ]

nyakojiru@lemmy.dbzer0.com

5 points

1 year ago

*

What about… they are making billions from that “read” and “storage” of information copyrighted from other people. They need to at least give royalties. This is like google behavior, using people data from “free” products to make billions. I would say they also need to pay people from the free data they crawled and monetized.

report

reply

[ +- ]

totallynotarobot@lemmy.world

12 points

1 year ago

Can’t reply directly to @OldGreyTroll@kbin.social because of that “language” bug, but:

The problem is that they then sell the notes in that database for giant piles of cash. Props to you if you’re profiting off your research the way OpenAI can profit off its model.

But yes, the lack of meat is an issue. If I read that article right, it’s not the one being contested here though. (IANAL and this is the only article I’ve read on this particular suit, so I may be wrong).

report

reply

[ +- ]

Sjatar@sjatar.net

8 points

1 year ago

*

Was also going to reply to them!

"Well if you do that you source and reference. AIs do not do that, by design can’t.

So it’s more like you summarized a bunch of books. Pass it of as your own research. Then publish and sell that.

I’m pretty sure the authors of the books you used would be pissed."

Again cannot reply to kbin users.

“I don’t have a problem with the summarized part ^^ What is not present for a AI is that it cannot credit or reference. And that is makes up credits and references if asked to do so.” @bioemerl@kbin.social

report

reply

[ +- ]

totallynotarobot@lemmy.world

10 points

1 year ago

Good point, attribution is a non-trivial part of it.

report

reply

[ +- ]

bioemerl@kbin.social

9 points

1 year ago

*

It is 100% legal and common to sell summaries of books to people. That’s what a reviewer does. That’s what Wikipedia does in the plot section of literally every Wikipedia page about every book.

This is also ignoring the fact that Chat GPT is a hell of a lot more than a bunch of summaries

report

reply

[ +- ]

owf@kbin.social

1 point

1 year ago

The problem is that they then sell the notes in that database for giant piles of cash.

On top of that, they have no way of generating any notes without your input.

I believe the way these models work is fundamentally plagiaristic. It’s an “average of its inputs” situation, not a “greater than the sum of its parts” one.

GitHub Copilot doesn’t know how to code, it knows how to copy-and-paste from people who do. It’s useless without a million devs to crib off.

I think it’s a perfectly reasonable reaction to be rather upset when some Silicon Valley chuckleheads help themselves to your lfe’s work in order to build a bot to replace you.

report

reply

[ +- ]

totallynotarobot@lemmy.world

6 points

1 year ago

@owf@kbin.social can’t reply directly to you either, same language bug between lemmy and kbin.

That’s a great way to put it.

Frankly idc if it’s “technically legal,” it’s fucking slimy and desperately short-term. The aforementioned chuckleheads will doom our collective creativity for their own immediate gain if they’re not stopped.

report

reply

[ +- ]

dedale@kbin.social

75 points

1 year ago

*

AI fear is going to be the trojan horse for even harsher and stupider ‘intellectual property’ laws.

report

reply

[ +- ]

babelspace@kbin.social

10 points

1 year ago

I wish I could get through to people who fear AI copyright infringement on this point.

report

reply

[ +- ]

bioemerl@kbin.social

44 points

1 year ago

*

Yeah, they want the right only to protect who copies their work and distributes it to other people, but who’s able to actually read and learn from their work.

It’s asinine and we should be rolling back copy right, not making it more strict. This 70 year plus the life of the author thing is bullshit.

report

reply

[ +- ]

babelspace@kbin.social

2 points

1 year ago

Since any reductions to copyright, if they occur at all, will take a while to happen, I hope someone comes up with an opt-in limited term copyright. At max, I’d be satisfied with a 45-50 year limited copyright on everything I make, and could see going shorter under plenty of circumstances.

report

reply

[ +- ]

RedCowboy@lemmy.world

29 points

1 year ago

Copyright of code/research is one of the biggest scams in the world. It hinders development and only exists so the creator can make money, plus it locks knowledge behind a paywall

report

reply

[ +- ]

Wander@kbin.social

7 points

1 year ago

It’s generally not the creator who gets the money.

report

reply

[ +- ]

Pseu@kbin.social

9 points

1 year ago

Researchers pay for publication, and then the publisher doesn’t pay for peer review, then charges the reader to read research that they basically just slapped on a website.

It’s the publisher middlemen that need to be ousted from academia, the researchers don’t get a dime.

report

reply

[ +- ]

Pseu@kbin.social

17 points

1 year ago

*

Remember, Creative Commons licenses often require attribution if you use the work in a derivative product, and sometimes require ShareAlike. Without these things, there would be basically no protection from a large firm copying a work and calling it their own.

Rolling pack copyright protection in these areas will enable large companies with traditional copyright systems to wholesale take over open source projects, to the detriment of everyone. Closed source software isn’t going to be available to AI scrapers, so this only really affects open source projects and open data, exactly the sort of people who should have more protection.

report

reply

[ +- ]

bioemerl@kbin.social

4 points

1 year ago

Closed source software isn’t going to be available to AI scrapers, so this only really affects open source projects and open data, exactly the sort of people who should have more protection.

The point of open source is contributing to the crater all of humanity. If open source contributes to an AI which can program, and that programming AI leads to increased productivity and ability in the general economy then open source has served its purpose, and people will likely continue to contribute to it.

Creative of Commons applies to when you redistribute code. (In the ideal case) AI does not redistribute code, it learns from it.

And the increased ability to program by the average person will allow programmers to be more productive and as a result allow more things to be open source and more things to be programmed in general. We will all benefit, and that is what open source is for.

report

reply

[ +- ]

magic_lobster_party@kbin.social

9 points

1 year ago

There’s also GPL, which states that derivations of GPL code can only be used in GPL software. GPL also states that GPL software must also be open source.

ChatGPT is likely trained on GPL code. Does that mean all code ChatGPT generates is GPL?

I wouldn’t be surprised if there would be an update to GPL that makes it clear that any machine learning model trained on GPL code must also be GPL.

report

reply

[ +- ]

Aviandelight @mander.xyz

2 points

1 year ago

Can’t reply directly to @OldGreyTroll@kbin.social because of that “language” bug, as well. This is an interesting argument. I would imagine that the AI does not have the ability to follow plagiarism rules. Does it even credit sources? I’ve seen plenty of complaints from students getting in trouble because anti cheating software flags their original work as plagiarism. More importantly I really believe we need to take a firm stance on what is ethical to feed into chat gpt. Right now it’s the wild west.

report

reply

Technology

!technology@lemmy.world

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

@L4s@lemmy.world
@autotldr@lemmings.world
@PipedLinkBot@feddit.rocks
@wikibot@lemmy.world

Community stats

18K
Monthly active users
12K
Posts
553K
Comments

Community moderators

L3s@lemmy.world
L3s@fry.gs
L4sBot@fry.gsB
L4sBot@lemmy.worldB
enu@lemmy.world

modlog legal instances join-lemmy.org

lemmy-ui-next v0.11.0 (github)lemmy v0.19.5 (github)