Obviously there’s not a lot of love for OpenAI and other corporate API generative AI here, but how does the community feel about self hosted models? Especially stuff like the Linux Foundation’s Open Model Initiative?
I feel like a lot of people just don’t know there are Apache/CC-BY-NC licensed “AI” they can run on sane desktops, right now, that are incredible. I’m thinking of the most recent Command-R, specifically. I can run it on one GPU, and it blows expensive API models away, and it’s mine to use.
And there are efforts to kill the power cost of inference and training with stuff like matrix-multiplication free models, open source and legally licensed datasets, cheap training… and OpenAI and such want to shut down all of this because it breaks their monopoly, where they can just outspend everyone scaling , stealiing data and destroying the planet. And it’s actually a threat to them.
Again, I feel like corporate social media vs fediverse is a good anology, where one is kinda destroying the planet and the other, while still niche, problematic and a WIP, kills a lot of the downsides.
I think it’s amazing. I’m running Ollama with a bunch of open-source llms. You’re right. It’s so good. The problem is keeping up to date on what the newest development is.
The pace of progress is so fast and it’s really difficult to know what the cool kids are experimenting with this moment.
Oh, and if your hardware is AMD or Nvidia, you should really give exllama a shot.
If it’s Apple, you should investigate kobold.cpp and more “nitty gritty” llama.cpp backends.
I have largely negative feelings towards ollama for a lot of reasons, but one of them is that it hides a lot of the knobs to get the absolute best out of LLMs, and understand how they work.
I’d recommend TabbyAPI with your favorite frontend, anything that works with OpenAI.
Or exui (which is what I tend to use) but is a bit more manual. text-gen-web-ui has better samplers, but its IMO more clanky and crufty, and really slow at long context.
Also, uh, you’ll have to be careful about picking a model, you have to fit it to your GPU instead of letting ollama do it for you. I view this as a positive, as it forces you to search more a more optimal fit.
Honestly a big problem is that the community for filtering the news has “collapsed.”
The only reasonable congregation was basically /r/localllama, and due to a number of factors (including, apparently, a Reddit bug that was driving away traffic according to a mod), and its shrunken a ton.
Twitter, linkedin, youtube and such are awful and full of straight up lies. Huggingface is just impossible to navigate and filter. There are a few niche aggregators, but they come and go.
Hence I was hoping lemmy would grow its existing ML communities, but most of lemmy seems broadly anti AI, even anti open source AI, hence this post to get a feel if that’s true.
I read localllama through redlib but I don’t contribute. I am not technical enough to contribute and I don’t understand the math.
I have been looking at YouTube for some videos to try to explain it, but I haven’t found anything that is in the sweet spot between “video for non-technical people” and “video for people with PhD and quantum physics”
It’s a giant mess. Even the technical vidoes tend to be theoretical, and are either obsolete or do nothing to help you actually run them.
I would know nothing if I hadn’t been following the community since the Pygmalion/ESRGAN days
Really into local hosting and open LLM’s I’ve largely stepped back due to ‘fatigue’. I’ve downloaded tweaked and reshuffle models and programs then a couple months will pass and it’s lept forward again. Which is good but I figured I’d wait until it slowed a bit.
I will say the fact I can run a decent 7b and even 10b models and get decent responses and times with a 3070 is impressive. AnythingLLM has been a really handy program for me. Still in development but it’s been neat working with RAG. I also moved from textgen to LMStudio and am really liking it. I like textgen but I felt it got a bit side tracked. A lot of good suggestions in here so cheers OP.
You can probably run Nemo 12B pretty quickly, though llama 3.1/gemma 9b finetunes may be better tbh. Deepseek lite v2 code with offloading would still be fast, even though its a 16B, since its such a heavy MoE.
Hardware is such a limiting factor now. Once quad-channel APUs and such start coming out, I feel like it will open up the space, so people don’t have to hunt down used 3090s and built desktops around them.
Last I tried was a fimbul merge for 10.4b with rope for creative writing which was great but yeah 3.1 is where I’ve landed lately. I’ll have to check out nemo! Like you mentioned I was sitting on money to grab a 3090 but I think I’ll wait for rtx50xx to drive down prices or just for dedicated hardware. I’ll be sure to keep an eye the AI subs though, clearly there’s a community for it here that’s interested in discussion.
rtx50xx
Don’t,Nvidia is going to price gouge the snot out of it. Honestly, if you want to buy new, just get a 7900 XTX. Screw Nvidia’s pricing on new cards, lol.
fimbul merge for 10.4b
Speaking as someone who’s done a lot of merging, the “upscaling” merges are not great. Rope scaling the context is not either. You are better off finding models that were trained at the parameter count and context length you want in the first place, and there is a lot more choice these days.
Oh and I forgot to mention, instead of a 5090, buy AMD Strix Halo if its any good.
I cannot emphasize how awesome 128GB on a fast APU would be. That opens up (admittedly slow, but usable) inference of “huge” models like Mistral Large, and very fast inference of large MoE models like 8x22B.
I do think, it’s good that we’re able to self-host these models. Better than not being able to.
But the biggest draw of open-source to me is that I and others in the community can fix things.
It’s possible that I just don’t understand enough about how these models are created, but right now, it doesn’t feel like we’re able to fix things.
If the next LLaMa model loses all knowledge of the Uyghur genocide, because Facebook wants to distribute it in China, then I don’t know how we’d patch that back in. Even collecting the training data is tricky.
It feels a lot more like Creative Commons than open-source, i.e. you can use what they’ve created, and you can remix it, but adding to it is not easily possible.
I don’t know how we’d patch that back in. Even collecting the training data is tricky.
You can just take encyclopedia articles and news articles, then train it back in. It’s easy! This is not expensive, like $100 if its a really big model, and you are uncensoring a ton of topics?
People uncensor models all the time, its an avenue of research in the LLM community. And in fact, there are many quite good chinese models (like Qwen2) that have been “uncensorsed” by the community.
I’m most excited where it’s most open. Clear training process, legal data sets, fully open code bases, published reports, etc. I think we’re going to see the local models boom in sophistication once that’s more common.
Do you know of any good local models that fit that kind of description?
I don’t know of any super high-quality ones that run well, but the Open Assistant project, (now archived) collected responses from voluntary participants (myself included) to build what is now considered a very high-quality dataset of chat conversation pairs, truly open source, and all voluntarily submitted instead of scraped.
The models are reasonable for fine-tuning, but aren’t very good compared to newer models from large companies.
Cutting edge ones? Unfortunately, rarely. Right now there’s a sliding scale between “open and transparent” and “smart and performant” because they’re just so darn expensive to train.
I think some of the closest ones to your requirements are Nvidia’s research models, excluding Mistral Nemo which isn’t as well documented (as its really a Mistral Model). And you can see a lot of the open “alternative” efforts like RWKV, openllama and such are severely underfunded and undertrained.
The datasets are there, the highly optimized implementations are getting there, pieces are there, a lot of of models have detailed papers, fully open codebases, but the funding to actually do it is just too much to deal with most of the time.
Another factor is that “closed” datasets like whatever Mistral, Facebook, Cohere and such use do seem to have an edge.
This one is brand new: https://github.com/allenai/OLMoE
I’m in favor of a “ML-GPL”, where models must be made available for free to those whose data was used to train them.
Publishing a dataset is just inviting legal trouble. Look at all the nonsense Laion had to go through for Laion-5b. I;m not suprised people are not publishing datasets more.