Hey everyone! I’m finally diving into local hosting for the DeepSeek models, but I’m a bit torn on which quantization method to go with. I’m running a setup with dual 3090s (48GB VRAM total), so I have some room to play with, but these models are still massive. I’ve been looking at GGUF for ease of use, but I’ve heard EXL2 or AWQ might offer better inference speeds. I really don’t want to sacrifice too much of that impressive reasoning logic just to save space. Has anyone compared 4-bit vs 8-bit perplexity on these specifically? What’s your recommended 'sweet spot' for balancing speed and output quality for DeepSeek right now?
In my experience, I've tried many formats and EXL2 is the move for ur dual NVIDIA GeForce RTX 3090 24GB setup.
- EXL2 vs GGUF: EXL2 has way faster inference and basically zero logic loss at 5.0 bpw.
- GGUF/AWQ: Easier but highkey slower on dual gpus.
Best Choice: 5.0 bpw EXL2 is the sweet spot for DeepSeek logic!! gl!
Yo, I went through this last month! Basically, quantization is like squishing a massive file—it saves space but can lose detail. For dual-GPU setups, you gotta watch out for P2P bottlenecks or your speed will be toast.
Just sharing my experience:
- I started with GGUF Q4_K_M in llama.cpp, but it was highkey sluggish on two cards.
- Then I tried DeepSeek-V3 in AWQ with vLLM—speed was okay, but the setup was a total nightmare.
- I finally settled on EXL2 at 4.65 bpw. It's FANTASTIC for balancing reasoning logic and speed on dual NVIDIA GeForce RTX 3090 24GB cards.
Seriously, EXL2 is the winner for that 48GB VRAM limit. Good luck!!
bump
Tbh if you look at the current market trends for local LLMs, everyone is shifting away from just saving space and focusing on compute-aware quants. Since you have 48GB, you are in a realy sweet spot where you can move beyond the basic hobbyist formats. From a market research perspective, the industry is split between ease-of-use and raw enterprise performance.
Just catching up on this thread and honestly I totally agree with the point about shifting towards those professional standards. As someone who just started trying to do this myself over the last few days, I realized that having the hardware is only half the battle. I spent forever just trying to get the environment right because I really wanted to go the DIY route instead of just paying for a monthly professional service. Here is what I have learned while messing around with my current setup:
Exactly what I was thinking
Yep been there done that. Can confirm everything said above is spot on.
Saved for later, ty!
I have spent quite a long time testing various quantization setups for the larger DeepSeek weights, and unfortunately, the results have been pretty disappointing compared to the full precision versions. When you have a decent amount of VRAM, you expect a certain level of fluidity that many of these methods just dont provide.
I have been obsessing over these exact metrics for way too long now and it is seriously draining. Like someone mentioned, the logic errors in long-form reasoning are the real killer. It is one thing to see a decent perplexity score on paper, but it is another thing entirely when the model starts hallucinating halfway through a Python script because of some bit-depth compression issues. My main frustrations and things you really need to be careful with: