It’s all made from our data, anyway, so it should be ours to use as we want

134 points
*

It won’t really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.

What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.

They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.

permalink
report
reply
38 points

They pulled a very pubic and out in the open data heist

Oh no, not the pubes! Get those curlies outta here!

permalink
report
parent
reply
15 points

Best correction ever. Fixed. ♥️

permalink
report
parent
reply
29 points

Legislation that prohibits publicly-viewable information from being analyzed without permission from the copyright holder would have some pretty dramatic and dire unintended consequences.

permalink
report
parent
reply
-3 points

Not really. The same way you can’t sell live and public performance music for profit and not get sued. Case law right there, and the fact it’s performance vs publicly published doesn’t matter. How the owner and originator classifies or licenses it is the defining classification. It’s going to be years before anyone sees this get a ruling in court though.

permalink
report
parent
reply
13 points

That’s not what’s going on here, though. The LLM model doesn’t contain the actual copyrighted data, it’s the result of analyzing the copyrighted data.

An analogous example would be a site like TV Tropes. TV Tropes doesn’t contain the works that it’s discussing, it just contains information about those works.

permalink
report
parent
reply
6 points

It’s already illegal in some form. Via piracy of the works and regurgitating protected data.

The issue is mega Corp with many rich investors vs everyone else. If this were some university student their life would probably be ruined like with what happened to Aaron Swartz.

The US justice system is different for different people.

permalink
report
parent
reply
6 points
*

If we can’t train on unlicensed data, there is no open-source scene. Even worse, AI stays but it becomes a monopoly in the hands of the few who can pay for the data.

Most of that data is owned and aggregated by entities such as record labels, Hollywood, Instagram, reddit, Getty, etc.

The field would still remain hyper competitive for artists and other trades that are affected by AI. It would only cause all the new AI based tools to be behind expensive censored subscription models owned by either Microsoft or Google.

I think forcing all models trained on unlicensed data to be open source is a great idea but actually rooting for civil lawsuits which essentially entail a huge broadening of copyright laws is simply foolhardy imo.

permalink
report
parent
reply
0 points

Unlicensed from the POV of the trainer, meaning they didn’t contact or license content from someone who didn’t approve. If it’s posted under Creative Commons, that’s fine. If it’s otherwise posted that it’s not open in any other way and not for corporate use, then they need to contact the owner and license it.

permalink
report
parent
reply
3 points
*

They won’t need to, they will get it from Getty. All these websites have a ToS that make it very clear they can do whatever they want with what you upload. The courts will simply never side with the small time photographer who makes 50$ a month with his stock photos hosted on someone else’s website. The laws will be in favor of databrokers and the handful of big AI companies.

Anyone self hosting will simply not get a call. Journalists will keep the same salary while the newspaper’s owner gets a fat bonus. Even Reddit already sold it’s data for 60 million and none of that went anywhere but spezs coke fund.

permalink
report
parent
reply
3 points

But wouldn’t that mean making it open source, then it not functioning properly without the data while open, would prove that it is using a huge amount of unlicensed data?

Probably not “burden of proof in a court of law” prove though.

permalink
report
parent
reply
8 points

Making it open source doesn’t change how it works. It doesn’t need the data after it’s been trained. Most of these AIs are just figuring out patterns to look for in the new data it comes across.

permalink
report
parent
reply
3 points

So you’re saying the data wouldn’t exist anywhere in the source code, but it would still be able to answer questions based on the data it has previously seen?

permalink
report
parent
reply
2 points
*

in civil matters, the burden of proof is actually usually just preponderance of evidence and not beyond a reasonable doubt. in other words to win a lawsuit, you only need to have more compelling evidence than the other person.

permalink
report
parent
reply
4 points

But you still have to have EVIDENCE. Not derivative evidence. The output of a model could be argued to be hearsay because it’s not direct evidence of originating content, it’s derivative.

You’d have to have somebody backtrack generations of model data to even find snippets of something that defines copyright material, or a human actually saying “Yes, we definitely trained on unlicensed data”.

permalink
report
parent
reply
2 points

Just a little note about the word “model”, in the article it’s used in a way that actually includes the weights, and I think this is the usual way of using it! If you change the weights, you get a different model, though the two models will have the same structure.

Anyway, you make good points!

permalink
report
parent
reply
90 points

So banks will be public domain when they’re bailed out with taxpayer funds, too, right?

permalink
report
reply
63 points

They should be, but currently it depends on the type of bailout, I suppose.

For instance, if a bank completely fails and goes under, the FDIC usually is named Receiver of the bank’s assets, and now effectively owns the bank.

permalink
report
parent
reply
11 points

At the same time, if a bank goes under, that means they owe more than they own, so “ownership” of that entity is basically worthless. In those cases, a bailout of the customers does nothing for the owners, because the owners still get wiped out.

The GM bailout in 2009 also involved wiping out all the shareholders, the government taking ownership of the new company, and the government spinning off the newly issued stock.

AIG required the company basically issue new stock to dilute owners down to 20% of the company, while the government owned the other 80%, and the government made a big profit when they exited that transaction and sold the stock off to the public.

So it’s not super unusual. Government can take ownership of companies as a condition of a bailout. What we generally don’t necessarily want is the government owning a company long term, because there’s some conflict of interest between its role as regulator and its interest as a shareholder.

permalink
report
parent
reply
4 points
*

With banks this is also true if they do not have enough liquid assets to meet the legal requirements. So the bank might not be able to count all bank accounts as assets but the FDIC is. Also they can then restructure the bank and force creditors to take a haircut.

This is why investment banks should be separate from banks that have consumer accounts that are insured by the government.
Then you can just let the investment bank fail. This was the whole premise of glass steagall that was repealed under clinton…

permalink
report
parent
reply
10 points
*

Public domain wouldn’t be the right term for banks being publicly owned. At least for the normal usage of Public Domain in copyright. You can copy text and data, you can’t copy a company with unique customers and physical property.

permalink
report
parent
reply
4 points

Oh good point. I’m not actually sure what the phrase would be… Publicly owned?

permalink
report
parent
reply
3 points

Just FYI of the bank bailouts in the US, the banks paid back the bailout plus interest back to the government. Meaning the govt actually made a profit off the bailout. There’s a lot of things wrong with both banks and the govt, but generally this is not one of them. https://www.propublica.org/article/the-bailout-was-11-years-ago-were-still-tracking-every-penny

permalink
report
parent
reply
1 point

Super interesting, learned something new today. Thanks!

permalink
report
parent
reply
2 points

I mean, that sometimes did happen.

Germany propped up the Commerzbank after 2007 by essentially buying a large part of it, and managed to sell several tranches with a healthy profit.

Same is true for Lufthansa during COVID.

permalink
report
parent
reply
1 point

Banks are redundant, so is the stock market. These institutions do not need to, and should not be private. They are level playing fields in the economy, not participants trying to tilt the board for taking over the game.

permalink
report
parent
reply
-1 points
*

No, “the banks” wouldn’t be what the AI would be trained on, it would be the private info of individuals the banks do business with.

permalink
report
parent
reply
62 points

A similar argument can be made about nationalizing corporations which break various laws, betray public trust, etc etc.

I’m not commenting on the virtues of such an approach, but I think it is fair to say that it is unrealistic, especially for countries like the US which fetishize profit at any cost.

permalink
report
reply
9 points

Yes, mining companies should all be nationalised for digging up the country’s ground and putting carbon in the country’s air.

permalink
report
parent
reply
0 points

We essentially do have the death penalty for corporations, it’s called being declared a criminal organisation.

permalink
report
parent
reply
-21 points

You must be fun at parties.

permalink
report
parent
reply
9 points

this comment doesn’t make any sense

permalink
report
parent
reply
2 points

You must be new here.

permalink
report
parent
reply
62 points

I don’t think it should be a “punishment.” It should be done on principal.

permalink
report
reply
4 points
*

Not sure making their LLMs public domain would really hurt their principal, their secret sauce is in the code around the model.

And yes, I do recognize that you meant “principle”.

permalink
report
parent
reply
3 points

That’s not true though. The models themselves are hella intensive to train. We already have open source programs to run LLMs at home, but they are limited to smaller open-weights models. Having a full ChatGPT model that can be run by any service provider or home server enthusiast would be a boon. It would certainly make my research more effective.

permalink
report
parent
reply
0 points

We already have multiple trained models, here are a bunch. The model isn’t nearly as interesting as what you do with it.

permalink
report
parent
reply
1 point
Deleted by creator
permalink
report
parent
reply
56 points

It’s not punishment, LLM do not belong to them, they belong to all of humanity. Tear down the enclosing fences.

This is our common heritage, not OpenAI’s private property

permalink
report
reply
2 points

It doesn’t matter anyway, we still need the big companies to bankroll AI. So it effectively does belong to them whatever we do.

Hopefully at some point people can get the processor requirements to something sane and AI development opens up to us all.

permalink
report
parent
reply

Technology

!technology@lemmy.world

Create post

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


Community stats

  • 18K

    Monthly active users

  • 14K

    Posts

  • 619K

    Comments