I don’t buy the pencil comparison. If I have a painting in my basement that has a distinctive style, but has never been digitized and trained upon, I’d wager you wouldn’t be able to recreate neither that image nor it’s style. What gives? Because AI is not a pencil but more like a data mixer you throw complete works in into and it spews out colllages. Maybe collages of very finely shredded pieces, to the point you could even tell, but pieces of original works nontheless. If you put any non-free works in it, they definitely contaminate the output, and so the act of putting them in in the first place should be a copyright violation in itself. The same as if I were to show you the image in question and you decided to recreate it, I can sue you and I will win.
That is a fundamental misunderstanding of how AI works. It does not shred the art and recreate things with the pieces. It doesn’t even store the art in the algorithm. One of the biggest methods right now is basically taking an image of purely random pixels. You show it a piece of art with a whole lot of tags attached. It then semi-randomly changes pixel colors until it matches the training image. That set of instructions is associated with the tags, and the two are combined into a series of tiny weights that the randomizer uses. Then the next image modifies the weights. Then the next, then the next. It’s all just teeny tiny modifications to random number generation. Even if you trained an AI on only a single image, it would be almost impossible for it to produce it again perfectly because each generation starts with a truly (as truly as a computer can get, an unweighted) random image of pixels. Even if you force fed it the same starting image of noise that it trained on, it is still only weighting random numbers and still probably won’t create the original art, though it may be more or less undistinguishable at a glance.
AI is just another tool. Like many digital art tools before it, it has been maligned from the start. But the truth is what it produces is the issue, not how. Stealing others’ art by manually reproducing it or using AI is just as bad. Using art you’re familiar with to inspire your own creation, or using an AI trained on known art to make your own creation, should be fine.
As a side note because it wasn’t too clear from your writing, but the weights are only tweaked a tiny tiny bit by each training image. Unless the trainer sees the same image a shitload of times (Mona Lisa, that one stock photo used to show off phone cases, etc) then the image can’t be recreated by the AI at all. Elements of the image that are shared with lots of other images (shading style, poses, Mario’s general character design, etc) could, but you’re never getting that one original image or even any particular identifiable element from it out of the AI. The AI learns concepts and how they interact because the amount of influence it takes from each individual image and its caption is so incredibly tiny but it trains on hundreds of millions of images and captions. The goal of the AI image generation is to be able to create vast variety of images directed by prompts, and generating lots of images which directly resemble anything in the training set is undesirable, and in the field it’s called over-fitting.
Anyways, the end result is that AI isn’t photo-bashing, it’s more like concept-bashing. And lots of methods exist now to better control the outputs, from ControlNet, to fine-tuning on a smaller set of images, to Dalle-3 which can follow complex natural language prompts better than older methods.
Regardless, lots of people find that training generative AI using a mass of otherwise copyrighted data (images, fan fiction, news articles, ebooks, what have you) without prior consent just really icky.
You show it a piece of art with a whole lot of tags attached. It then semi-randomly changes pixel colors until it matches the training image. That set of instructions is associated with the tags, and the two are combined into a series of tiny weights that the randomizer uses. Anyways, the end result is that AI isn’t photo-bashing, it’s more like concept-bashing
That’s what I’ve meant by “very finely shredded pieces”. Ioversimplifed it, yes. But what I mean is that it’s not literally taking a pixel off an image and putting it into output. But that using the original image in any way is just copying with extra steps.
Say, we forego AI entirely and talk real world copyright. If I were to record a movie theater screen with a camcorder, I would commit copyright infringement, even though it’s transformed by my camera lens. Same as If I were to distribute the copyrighted work in a ZIP file, invert colors, or trace every frame and paint it with watercolors.
What if I was to distribute the work’s name alongside it’s SHA-1 hash? You might argue that such transformation destroys the original work and can no longer be used to retrieve the original and therefore should be legal. But, if that was the case, torrent site owners could sleep peacefully knowing that they are safe from prosecution. Real world has shown that it’s not the case.
Now, what if we take some hashing function and brute force the seed until we get one which outputs the SHA-1’s of certain works given their names. That’d be a terrible version of AI, acting exactly like an over-trained model would: spouting random numbers except for works it was “trained” upon. Is distributing such seed/weight a copyright violation? I’d argue that’d be an overly complicated way to conceal piracy, but yes, it would be. Because those seeds/weights are are still a based on the original works, even if not strictly a direct result of their transformation.
Anyways, the end result is that AI isn’t photo-bashing, it’s more like concept-bashing
Copying concepts is also a copyright infringement, though
Regardless, lots of people find that training generative AI using a mass of otherwise copyrighted data (images, fan fiction, news articles, ebooks, what have you) without prior consent just really icky.
It shouldn’t be just “icky”, it should be illegal and be prosecuted ASAP. The longer it goes on like this, the more the entire internet is going to be filled with those kind-of-copyrighted things, and eventually turn into a lawsuit shitstorm.