AI singer-songwriter ‘Anna Indiana’ debuted her first single ‘Betrayed by this Town’ on X, formerly Twitter—and listeners were not too impressed.
I’ve been doing a lot of using, testing, and evaluating LLMs and GPT-style models for generating code and text/prose. Some of it is just general use to see how it behaves, some has been explicit evaluation of creative writing, and a bunch of it is code generation to test out how we need to modify our CS curriculum in light of these new tools.
It’s an impressive piece of technology, but it’s not very creative. It’s meh. The results are meh. Which is to be expected since it’s a statistical model that’s using a large body of prior work to produce a reasonable approximation of what it’s seen before. It trends towards the mean, not the best.
This’d explain why inexperienced users of ai would inevitably get mediocre results. Still takes creativity to get stolen mediocrity.
and a bunch of it is code generation to test out how we need to modify our CS curriculum in light of these new tools.
I’m curious if you’ve gotten anything decent out of them. I’ve tried to use it for tech/code questions, and it’s been nothing but disappointment after disappointment. I’ve tried to use it to get help with new concepts, but it hallucinates like crazy and always give me bad results, some of the time it’s so bad that it gives me answers I’ve already told it we’re wrong.
Yeah, I’ve just set up a hotkey that says something like “back up your answer with multiple reputable sources” and I just always paste it at the end of everything I ask. If it can’t find webpages to show me to back up its claims then I can’t trust it. Of course this isn’t the case with coding, for that I can actually run the code to verify it.
What version are you using?
GPT-4 is quite impressive, and the dedicated code LLMs like Codex and Copilot are as well. The latter must have had a significant update in the past few months, as it’s become wildly better almost overnight. If trying it out, you should really do so in an existing codebase it can use as a context to match style and conventions from. Using a blank context is when you get the least impressive outputs from tools like those.
I’ve used gpt 3/3.5, bing, bard and copilot, and I’m not super stoked. Copilot gave me PS DSC items that don’t actually exist, which was my most recent attempt at using a LLM.
I might see about figuring out if it can hook into my vs code instance so it’s a bit smarter at some point.
It trends towards the mean, not the best.
That’s where some of the significant advances over the past 12 months of research have been, specifically around using the fine tuning phase to bias towards excellence. The biggest advance there has been that capabilities in larger models seem to be transmissible to smaller models by feeding in output from the larger more complex models.
Also, the process supervision work to enhance CoT from May is pretty nuts.
So while you are correct that the pretrained models come out with a regression towards the mean, there are very promising recent advances in taking that foundation and moving it towards excellence.
I’m excited for how these tools will be used by human creators to accomplish things they could never do alone, and in that aspect it is a revolutionary technology. I hate that their marketing calls it “AI” though, the only intelligence involved is the human user that creates prompts and curates results.