I mean, you’ve never seen a purple elephant with a tennis racket. None of that exists in the data set since elephants are neither purple nor tennis players. Exposure to all the individual elements allows for generation of concepts outside the existing data, even though they don’t exit in reality or in the data set.
Ok.
Try to get an image generator to create an image of a tennis racket, with all racket-like objects or relevant sport data removed from the training data.
Explain the concept to it with words alone, accurately enough to get something that looks exactly like the real thing. Maybe you can give it pictures, but one won’t really be enough, you’ll basically have to give it that chunk of training data you removed.
That’s the problem you’ll run into the second you want to realize a new game genre.
There are more forms of guidance than just raw words. Just off the top of my head, there’s inpainting, outpainting, controlnets, prompt editing, and embeddings. The researchers who pulled this off definitely didn’t do it with text prompts.
Obviously.
But at what point does that guidance just become the dataset you removed from the training data?
To get it to run Doom, they used Doom.
To realize a new genre, you’ll “just” have to make that game the old fashion way, first.