Best quantization level for 8GB VRAM with DeepSeek?

Question

Hey everyone! I’ve been diving into the DeepSeek models lately, specifically looking to run the 6.7B or maybe even a squeezed version of the 33B on my local machine. My main bottleneck is that I'm working with an RTX 3070, which only gives me 8GB of VRAM to play with.

I’ve been experimenting with LM Studio and Ollama, but I’m struggling to find the 'sweet spot' where I don't sacrifice too much intelligence for speed. I tried a Q8_0 quant, but it pretty much maxed out my memory and started lagging the rest of my system. On the flip side, I'm worried that dropping down to something like Q3 or Q2 will turn the model's logic into total mush. I’m mostly using it for Python coding assistance and some general creative writing, so maintaining coherence is pretty important to me.

For those of you with similar 8GB setups, what quantization level (GGUF or EXL2) have you found provides the best balance of tokens-per-second and reasoning quality? Is it better to run a smaller 7B model at Q6_K, or try to cram a larger model into a Q4_K_M? I'd love to hear your experiences!

BarbicianArts · Accepted Answer

For your situation, ngl the 8GB on that NVIDIA GeForce RTX 3070 8GB is a struggle. I tried running DeepSeek-Coder-33B-Instruct at Q4_K_M but it was basically a slide show... super disappointing. Honestly, you're better off with the DeepSeek-Coder-6.7B-Instruct at Q6_K or even Q8_0. It stays snappy and the logic is actually solid for Python. Cramming a 33B into Q2_K just makes it halluinate like crazy, so it's kinda useless for coding imo. Stick to the 7B models!

unqszpsmxo · Answer

For your situation, I've spent years messing with VRAM limits and honestly, 8GB is that awkward middle ground where you gotta be smart about it. Basically, quantization isn't just about shrinking the file; it's about how the model 'remembers' weights. When you drop to Q2 or Q3, you lose the 'coherence' because the math gets too fuzzy.

On an NVIDIA GeForce RTX 3070 8GB, here's what I suggest for the best balance:

- Stick to the DeepSeek-Coder-6.7B-Instruct at Q5_K_M or Q6_K. It's way more reliable for Python than a heavily lobotomized 33B.
- If you really wanna try the bigger models, look for the DeepSeek-V2-Lite-Chat GGUF versions.
- Make sure to use 'Flash Attention' in LM Studio to save a bit of memory.

I mean, a Q4_K_M on a 7B model is usually the 'gold standard' for speed, but for coding, that extra bit of precision in Q6 actually helps with syntax. Q8 is literally overkill for 8GB and just chokes your OS. gl!

igdifkoqfl · Answer

sooo I went through this last year... honestly, its such a struggle trying to balance that 8GB VRAM limit. iirc i spent way too much time trying to squeeze everything out of my rig. I eventually just stuck with the GGUF format for everything cuz its basically the only way to stay sane without crashing. I mean, going with any of the quantized versions from the big names on HuggingFace is usually the play. tbh just stick with GGUF files and youll be fine. gl!