micheal65536

Not sure if I’ll stick around though, this platform is really unstable. Comments don’t sync/show up half the time so you miss the actual discussion that takes place (on the rare occasion that there is actual discussion) and the mobile app has literally crashed three times already between me seeing that there is a new post, opening said post, and trying to write a reply to it. Web version is also very glitchy, half the time I can’t log in or viewing a community just gives a blank page that never loads.

As a technology/computing person I do understand why it is that way, but that doesn’t make it any less frustrating from a user perspective. It’s no fun being part of a community that’s already tiny to begin with while knowing that you’re probably also missing a substantial portion of the posts/comments that do exist with no way to even know that they’re there.

I legit don’t even know if you will ever read this or if it will fall into the synchronisation black hole between my instance and yours.

permalink

report

[ - ]

micheal65536@lemmy.micheal65536.duckdns.org

2 points

1 year ago

in barefoot@sh.itjust.works•Do you guys wear shoes in indoor public spaces?

I think you read that the wrong way around ;-)

permalink

report

parent

[ - ]

micheal65536@lemmy.micheal65536.duckdns.org

3 points

1 year ago

in simpleliving@lemm.ee•Simple joy from repairs?

As an autistic person I like to minimise change. I’m also good at learning how different appliances/equipment works and how to repair it. So I always try to repair things first before replacing them. I get the double satisfaction of not wasting an otherwise perfectly good appliance and also getting to keep the appliance that I am familiar with and like instead of having to try to find a different one.

permalink

report

[ - ]

micheal65536@lemmy.micheal65536.duckdns.org

1 point

1 year ago

in localllama@sh.itjust.works•What are the best models you use?

I thought the original LLaMa was not particularly good for conversational-format interaction as it is not instruction fine-tuned? I thought its mode of operation is always “continue the provided text” (so for example instead of saying “please write an article about…” you would have to write the title and opening paragraph of an article and then it would continue the article).

permalink

report

[ - ]

micheal65536@lemmy.micheal65536.duckdns.org

1 point

1 year ago

in localllama@sh.itjust.works•Discovering Locally Run Language Models: Share Your Favorites/Not So Favorites!

Which one is the “newer” one? Looking at the quantised releases by TheBloke, I only see one version of 30B WizardLM (in multiple formats/quantisation sizes, plus the unofficial uncensored version).

permalink

report

parent

[ - ]

micheal65536@lemmy.micheal65536.duckdns.org

2 points

1 year ago

in localllama@sh.itjust.works•Open-Orca has released their second preview of OpenChat - Hugging Face

I am getting very poor results with this model. Its coding ability is noticeably worse than LLaMa 2. It will readily produce output that claims to be following a logical progression of steps, but often the actual answer is not consistent with the logic or the steps themselves are in fact not correct or logical.

Curious to know if other people who have tried it are getting the same results.

permalink

report

What is wrong with LLM benchmarks, and why are we still using them?

posted 1 year ago

micheal65536@lemmy.micheal65536.duckdns.org

localllama@sh.itjust.works

12 commentshide report

What is wrong with LLM benchmarks, and why are we still using them? - sh.itjust.works(sh.itjust.works)

posted 1 year ago

micheal65536@lemmy.micheal65536.duckdns.org

fosai@lemmy.world

3 commentshide report

[ - ]

micheal65536@lemmy.micheal65536.duckdns.org

2 points

1 year ago

in localllama@sh.itjust.works•Open-Orca has released their second preview of OpenChat - Hugging Face

I haven’t tried that one yet because it seemed like an older model with a less refined dataset but I will put that one in the queue as the next model to download and try out.

permalink

report

parent

[ - ]

micheal65536@lemmy.micheal65536.duckdns.orgOP

3 points

1 year ago

in localllama@sh.itjust.works•What is wrong with LLM benchmarks, and why are we still using them?

I see your point and we are currently at the “trying to look good on benchmarks” stage with LLMs but my concern/frustration at the moment is that this is actually hindering real progress. Because researchers/developers are looking at the benchmarks and saying “it’s X percentage, this is a big improvement” while ignoring real-world performance.

Questions like “how important is the parameter count” (I think it is more important than people are currently acknowledging) are being left unanswered because meanwhile people are saying “here’s a 13B parameter model that scores X percentage compared to GPT-3” as if to imply that smaller = better even though this may be impeding the model’s actual reasoning ability compared to learning patterns that score well on benchmarks. And new training methods are being developed (see: Evol-Instruct, Orca) through benchmark comparisons and not with consideration of their real-world performance.

I get that benchmarks are an important and useful tool, and I get that performing well on them is a motivating factor in an emerging and competitive industry. But I can’t accept such an immediately-noticeable decline in real-world performance (model literally craps itself) compared to previous models while simultaneously bragging about how outstanding the benchmark performance is.

permalink

report

parent

[ - ]

micheal65536@lemmy.micheal65536.duckdns.orgOP

0 points

1 year ago

in localllama@sh.itjust.works•What is wrong with LLM benchmarks, and why are we still using them?

Yeah, I’m aware of how sampling and prompt format affect models. I always try to use the correct prompt format (although sometimes there are contradictions between what the documentation says and what the preset for the model in text-generation-webui says, in which case I often try both with no noticeable difference in results). For sampling I normally use the llama-cpp-python defaults and give the model a few attempts to answer the question (regenerate), sometimes I try it on a deterministic setting.

I wasn’t aware that the benchmarks are multi-shot. I haven’t looked so much into how the benchmarks are actually performed, tbh. But this is useful to know for comparison.

permalink

report

parent