micheal65536
No, I don’t.
Also here from Reddit, you might remember my username.
Not sure if I’ll stick around though, this platform is really unstable. Comments don’t sync/show up half the time so you miss the actual discussion that takes place (on the rare occasion that there is actual discussion) and the mobile app has literally crashed three times already between me seeing that there is a new post, opening said post, and trying to write a reply to it. Web version is also very glitchy, half the time I can’t log in or viewing a community just gives a blank page that never loads.
As a technology/computing person I do understand why it is that way, but that doesn’t make it any less frustrating from a user perspective. It’s no fun being part of a community that’s already tiny to begin with while knowing that you’re probably also missing a substantial portion of the posts/comments that do exist with no way to even know that they’re there.
I legit don’t even know if you will ever read this or if it will fall into the synchronisation black hole between my instance and yours.
As an autistic person I like to minimise change. I’m also good at learning how different appliances/equipment works and how to repair it. So I always try to repair things first before replacing them. I get the double satisfaction of not wasting an otherwise perfectly good appliance and also getting to keep the appliance that I am familiar with and like instead of having to try to find a different one.
I thought the original LLaMa was not particularly good for conversational-format interaction as it is not instruction fine-tuned? I thought its mode of operation is always “continue the provided text” (so for example instead of saying “please write an article about…” you would have to write the title and opening paragraph of an article and then it would continue the article).
I am getting very poor results with this model. Its coding ability is noticeably worse than LLaMa 2. It will readily produce output that claims to be following a logical progression of steps, but often the actual answer is not consistent with the logic or the steps themselves are in fact not correct or logical.
Curious to know if other people who have tried it are getting the same results.
I see your point and we are currently at the “trying to look good on benchmarks” stage with LLMs but my concern/frustration at the moment is that this is actually hindering real progress. Because researchers/developers are looking at the benchmarks and saying “it’s X percentage, this is a big improvement” while ignoring real-world performance.
Questions like “how important is the parameter count” (I think it is more important than people are currently acknowledging) are being left unanswered because meanwhile people are saying “here’s a 13B parameter model that scores X percentage compared to GPT-3” as if to imply that smaller = better even though this may be impeding the model’s actual reasoning ability compared to learning patterns that score well on benchmarks. And new training methods are being developed (see: Evol-Instruct, Orca) through benchmark comparisons and not with consideration of their real-world performance.
I get that benchmarks are an important and useful tool, and I get that performing well on them is a motivating factor in an emerging and competitive industry. But I can’t accept such an immediately-noticeable decline in real-world performance (model literally craps itself) compared to previous models while simultaneously bragging about how outstanding the benchmark performance is.
Yeah, I’m aware of how sampling and prompt format affect models. I always try to use the correct prompt format (although sometimes there are contradictions between what the documentation says and what the preset for the model in text-generation-webui says, in which case I often try both with no noticeable difference in results). For sampling I normally use the llama-cpp-python defaults and give the model a few attempts to answer the question (regenerate), sometimes I try it on a deterministic setting.
I wasn’t aware that the benchmarks are multi-shot. I haven’t looked so much into how the benchmarks are actually performed, tbh. But this is useful to know for comparison.