Avatar

micheal65536

micheal65536@lemmy.micheal65536.duckdns.org
Joined
2 posts • 55 comments
Direct message

Also here from Reddit, you might remember my username.

Not sure if I’ll stick around though, this platform is really unstable. Comments don’t sync/show up half the time so you miss the actual discussion that takes place (on the rare occasion that there is actual discussion) and the mobile app has literally crashed three times already between me seeing that there is a new post, opening said post, and trying to write a reply to it. Web version is also very glitchy, half the time I can’t log in or viewing a community just gives a blank page that never loads.

As a technology/computing person I do understand why it is that way, but that doesn’t make it any less frustrating from a user perspective. It’s no fun being part of a community that’s already tiny to begin with while knowing that you’re probably also missing a substantial portion of the posts/comments that do exist with no way to even know that they’re there.

I legit don’t even know if you will ever read this or if it will fall into the synchronisation black hole between my instance and yours.

permalink
report
reply

I think you read that the wrong way around ;-)

permalink
report
parent
reply

As an autistic person I like to minimise change. I’m also good at learning how different appliances/equipment works and how to repair it. So I always try to repair things first before replacing them. I get the double satisfaction of not wasting an otherwise perfectly good appliance and also getting to keep the appliance that I am familiar with and like instead of having to try to find a different one.

permalink
report
reply

I thought the original LLaMa was not particularly good for conversational-format interaction as it is not instruction fine-tuned? I thought its mode of operation is always “continue the provided text” (so for example instead of saying “please write an article about…” you would have to write the title and opening paragraph of an article and then it would continue the article).

permalink
report
reply

Which one is the “newer” one? Looking at the quantised releases by TheBloke, I only see one version of 30B WizardLM (in multiple formats/quantisation sizes, plus the unofficial uncensored version).

permalink
report
parent
reply

I am getting very poor results with this model. Its coding ability is noticeably worse than LLaMa 2. It will readily produce output that claims to be following a logical progression of steps, but often the actual answer is not consistent with the logic or the steps themselves are in fact not correct or logical.

Curious to know if other people who have tried it are getting the same results.

permalink
report
reply

I haven’t tried that one yet because it seemed like an older model with a less refined dataset but I will put that one in the queue as the next model to download and try out.

permalink
report
parent
reply

I see your point and we are currently at the “trying to look good on benchmarks” stage with LLMs but my concern/frustration at the moment is that this is actually hindering real progress. Because researchers/developers are looking at the benchmarks and saying “it’s X percentage, this is a big improvement” while ignoring real-world performance.

Questions like “how important is the parameter count” (I think it is more important than people are currently acknowledging) are being left unanswered because meanwhile people are saying “here’s a 13B parameter model that scores X percentage compared to GPT-3” as if to imply that smaller = better even though this may be impeding the model’s actual reasoning ability compared to learning patterns that score well on benchmarks. And new training methods are being developed (see: Evol-Instruct, Orca) through benchmark comparisons and not with consideration of their real-world performance.

I get that benchmarks are an important and useful tool, and I get that performing well on them is a motivating factor in an emerging and competitive industry. But I can’t accept such an immediately-noticeable decline in real-world performance (model literally craps itself) compared to previous models while simultaneously bragging about how outstanding the benchmark performance is.

permalink
report
parent
reply

Yeah, I’m aware of how sampling and prompt format affect models. I always try to use the correct prompt format (although sometimes there are contradictions between what the documentation says and what the preset for the model in text-generation-webui says, in which case I often try both with no noticeable difference in results). For sampling I normally use the llama-cpp-python defaults and give the model a few attempts to answer the question (regenerate), sometimes I try it on a deterministic setting.

I wasn’t aware that the benchmarks are multi-shot. I haven’t looked so much into how the benchmarks are actually performed, tbh. But this is useful to know for comparison.

permalink
report
parent
reply