Testing the Limits: My GTX 1070 Rig vs Mistral Small 22B

posted 2 months ago

localllama@sh.itjust.works

Mistral Small 22B just dropped today and I am blown away by how good it is. I was already impressed with Mistral NeMo 12B’s abilities, so I didn’t know how much better a 22B could be. It passes really tough obscure trivia that NeMo couldn’t, and its reasoning abilities are even more refined.

With Mistral Small I have finally reached the plateu of what my hardware can handle for my personal usecase. I need my AI to be able to at least generate around my base reading speed. The lowest I can tolerate is 1.5~T/s lower than that is unacceptable. I really doubted that a 22B could even run on my measly Nvidia GTX 1070 8G VRRAM card and 16GB DDR4 RAM. Nemo ran at about 5.5t/s on this system, so how would Small do?

Mistral Small Q4_KM runs at 2.5T/s with 28 layers offloaded onto VRAM. As context increases that number goes to 1.7T/s. It is absolutely usable for real time conversation needs. I would like the token speed to be faster sure, and have considered going with the lowest Q4 recommended to help balance the speed a little. However, I am very happy just to have it running and actually usable in real time. Its crazy to me that such a seemingly advanced model fits on my modest hardware.

Im a little sad now though, since this is as far as I think I can go in the AI self hosting frontier without investing in a beefier card. Do I need a bigger smarter model than Mistral Small 22B? No. Hell, NeMo was serving me just fine. But now I want to know just how smart the biggest models get. I caught the AI Acquisition Syndrome!

Sort:

Hot Top Controversial New Old

[ - ]

atlas@sh.itjust.works

5 points

2 months ago

Hope you feel better soon

permalink

report

[ - ]

Smokeydope@lemmy.worldOP

2 points

2 months ago

Thanks, I shouldn’t have said I felt sad about it thats a little hyperbolic, just a little bothered. Im much more happy about finding a model that pushes my AI to its maximum potential while still being usable in real time.

permalink

report

parent

[ - ]

The Hobbyist@lemmy.zip

3 points

2 months ago

How surprising, I did not see anything about mistral small on HN so I tried it out and it seems pretty good for its size! Thanks for sharing!

permalink

report

[ - ]

brucethemoose@lemmy.world

3 points

2 months ago

Oh, and you HAVE to try the new Qwen 2.5 14B.

The whole lineup is freaking sick, 34B it outscoring llama 3.1 70B in a lot of benchmarks, and in personal use it feels super smart.

permalink

report

[ - ]

Possibly linux@lemmy.zip

3 points

2 months ago

I run Mixtral on my CPU

permalink

report

[ - ]

BaroqueInMind@lemmy.one

2 points

2 months ago

Read up on Hermes3 technical paper and you’ll realize it’s the best one. Running 8B model with the correct initial system prompt makes it as smart as GPT4o

permalink

report

[ - ]

Smokeydope@lemmy.worldOP

2 points

2 months ago

The linked paper was a good read. Thank you.

permalink

report

parent

[ - ]

BaroqueInMind@lemmy.one

2 points

2 months ago

Ironically, if you ask ChatGPT to write you an initial system prompt for Hermes that will sound similar to its own, it will essentially share a trade secret with you and give up portions of its system prompt to make your 8B self hosted LLM perform like a commercial one.

permalink

report

parent

LocalLLaMA

!localllama@sh.itjust.works

Create post

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

Community stats

22
Monthly active users
222
Posts
871
Comments

Community stats

Community moderators