What is the best qu...
 
Notifications
Clear all

What is the best quantization format for DeepSeek performance?

6 Posts
7 Users
0 Reactions
251 Views
0
Topic starter

Hey everyone! I’ve been diving deep into the world of local LLMs lately, and I’m finally looking to set up one of the newer DeepSeek models on my rig. However, I’m getting a bit overwhelmed by the sheer number of quantization formats available. Between GGUF, EXL2, AWQ, and even the newer HQQ methods, it’s hard to tell which one actually offers the best performance-to-quality ratio for this specific architecture.

I’m currently running a setup with dual RTX 3090s, so I have a decent amount of VRAM to work with (48GB total), but I really want to maximize my tokens per second without completely sacrificing the model's reasoning capabilities. I've noticed that some GGUF quants feel a bit sluggish on my hardware during long context windows, while I've heard EXL2 is lightning-fast but can be trickier to find for certain DeepSeek versions.

Since DeepSeek-V3 and R1 have such massive parameter counts, I'm worried about the 'perplexity penalty' if I go too low. Does one format stand out as the 'gold standard' for keeping that high-level intelligence while maintaining smooth inference? I'm particularly curious if anyone has compared 4-bit EXL2 against 5-bit or 6-bit GGUF quants on similar hardware.

If you're prioritizing a balance of speed and logic, which quantization method and bit-rate would you recommend I start with?


6 Answers
12

For your situation, I would suggest going with 4.5-bit or 5-bit EXL2 quantization if you can find the files, because it's basically the best way to keep things fast on dual NVIDIA GeForce RTX 3090 24gb cards without losing that logic. GGUF is kinda slow for long context, so honestly, sticking to EXL2 or maybe AWQ quantization is the move for speed.


10

yo! honestly i feel u on the confusion, it's basically a maze out here. i started with DeepSeek-V3 recently on my own setup and had some realy weird results at first. i tried a 6-bit GGUF and it was like... painfully slow during long chats. not ideal when ur trying to actually get work done!!

For your dual NVIDIA GeForce RTX 3090 24GB rig, here is what i've found:

- EXL2 vs GGUF: i tried both and EXL2 is literally night and day for speed. i mean, GGUF is okay for compatibility but the tokens per second on EXL2 (around 4.0bpw to 4.5bpw) feels sooo much smoother.
- The 'Logic' hit: ngl, i was worried about the perplexity penalty too, but 4.5bpw EXL2 still feels highkey smart. 4-bit is okay, but if u can squeeze 4.5 or 5.0 onto those 48GBs, do it.
- Setup: EXL2 can be a bit finicky to set up in Oobabooga Text Generation WebUI compared to just dropping a file in, but it's worth it.

so yeah, i'd say skip the 6-bit GGUF unless u dont mind waiting forever for replies. definitely try to find a 4.5bpw EXL2 quant first, it's basically the gold standard for speed/logic balance imo. gl with the setup!! 👍


3

Curious about one thing: are ur NVIDIA GeForce RTX 3090 24GB cards linked via NVLink or just running on the PCIe bus? Honestly, that makes a HUGE difference for context speeds on massive models like DeepSeek-R1.

In my experience, EXL2 is the gold standard for speed on dual 3090s, basically blowing GGUF out of the water once you hit those long context windows.

TL;DR: EXL2 at 4.0bpw is usually the sweet spot for speed vs logic, but I need to know your interconnect setup first!


3

Just saw this thread and tbh 48GB is a beastly setup but DeepSeek R1 and V3 are a whole different level of massive. Quick question tho, are you trying to squeeze the full 671B parameter model onto those cards using extreme quantization, or are you looking at the distilled versions like DeepSeek-R1-Distill-Llama-70B? Reason I ask is that if you're aiming for the full model, you're gonna have to go so low on the BPW (like sub 2.0) that the intelligence basically evaporates. For a DIY rig like yours, the sweet spot is usually running a high-quality quant of a distilled model. If you do go the EXL2 route, definitely look into 4-bit KV cache. It basically lets you double your context length without needing more VRAM. I'd also suggest checking out Aphrodite-engine instead of the usual loaders. It uses PagedAttention which is way better at handling those long context windows without the lag you're seeing in GGUF. It's a bit more setup work but totally worth it for the speed gains.


2

Honestly, after digging through a bunch of benchmarks and testing different backends myself, there’s a massive pitfall most people ignore when they’re chasing those high BPW numbers. It’s not just about the model weights - with massive architectures like DeepSeek-V3/R1, the KV cache is a total VRAM killer!!! If you fill your 48GB with a high-bit EXL2 or GGUF, you're gonna hit a wall the second your context gets long because there’s basically zero room left for the cache to expand. Here’s a few things to watch out for from a stability perspective:
- **VRAM Fragmentation:** On dual 3090 setups, some loaders handle the split between cards poorly, leading to OOM crashes even when it looks like you have a couple GBs free. - **Imbalanced Quants:** Be super careful with random GGUF uploads on Hugging Face. I’ve seen some "6-bit" quants that actually perform worse than 4-bit ones because the importance matrix (imatrix) wasn't calculated correctly for DeepSeek's specific MoE structure.
- **Loader Overhead:** Different engines (like vLLM vs. llama.cpp) have totally different memory footprints. A quant that works in one might crash another. Basically, if you’re prioritizing reliability, don't just grab the biggest file that fits. Leave yourself at least 6-10GB of headroom for that context window or you’ll get those sluggish responses no matter how fast the format is meant to be. Safety first or your logic is gonna suffer anyway when the cache starts swapping!


1

Same here!


Share: