What is the best quantization format for DeepSeek performance?

Question

Hey everyone! I’ve been diving deep into the world of local LLMs lately, and I’m finally looking to set up one of the newer DeepSeek models on my rig. However, I’m getting a bit overwhelmed by the sheer number of quantization formats available. Between GGUF, EXL2, AWQ, and even the newer HQQ methods, it’s hard to tell which one actually offers the best performance-to-quality ratio for this specific architecture.

I’m currently running a setup with dual RTX 3090s, so I have a decent amount of VRAM to work with (48GB total), but I really want to maximize my tokens per second without completely sacrificing the model's reasoning capabilities. I've noticed that some GGUF quants feel a bit sluggish on my hardware during long context windows, while I've heard EXL2 is lightning-fast but can be trickier to find for certain DeepSeek versions.

Since DeepSeek-V3 and R1 have such massive parameter counts, I'm worried about the 'perplexity penalty' if I go too low. Does one format stand out as the 'gold standard' for keeping that high-level intelligence while maintaining smooth inference? I'm particularly curious if anyone has compared 4-bit EXL2 against 5-bit or 6-bit GGUF quants on similar hardware.

If you're prioritizing a balance of speed and logic, which quantization method and bit-rate would you recommend I start with?

AccsMarket.net · Accepted Answer

For your situation, I would suggest going with 4.5-bit or 5-bit EXL2 quantization if you can find the files, because it's basically the best way to keep things fast on dual NVIDIA GeForce RTX 3090 24gb cards without losing that logic. GGUF is kinda slow for long context, so honestly, sticking to EXL2 or maybe AWQ quantization is the move for speed.

TrafalgalSquare · Answer

yo! honestly i feel u on the confusion, it's basically a maze out here. i started with DeepSeek-V3 recently on my own setup and had some realy weird results at first. i tried a 6-bit GGUF and it was like... painfully slow during long chats. not ideal when ur trying to actually get work done!!

For your dual NVIDIA GeForce RTX 3090 24GB rig, here is what i've found:

- EXL2 vs GGUF: i tried both and EXL2 is literally night and day for speed. i mean, GGUF is okay for compatibility but the tokens per second on EXL2 (around 4.0bpw to 4.5bpw) feels sooo much smoother.
- The 'Logic' hit: ngl, i was worried about the perplexity penalty too, but 4.5bpw EXL2 still feels highkey smart. 4-bit is okay, but if u can squeeze 4.5 or 5.0 onto those 48GBs, do it.
- Setup: EXL2 can be a bit finicky to set up in Oobabooga Text Generation WebUI compared to just dropping a file in, but it's worth it.

so yeah, i'd say skip the 6-bit GGUF unless u dont mind waiting forever for replies. definitely try to find a 4.5bpw EXL2 quant first, it's basically the gold standard for speed/logic balance imo. gl with the setup!! 👍

Jamesmew · Answer

Curious about one thing: are ur NVIDIA GeForce RTX 3090 24GB cards linked via NVLink or just running on the PCIe bus? Honestly, that makes a HUGE difference for context speeds on massive models like DeepSeek-R1.

In my experience, EXL2 is the gold standard for speed on dual 3090s, basically blowing GGUF out of the water once you hit those long context windows.

TL;DR: EXL2 at 4.0bpw is usually the sweet spot for speed vs logic, but I need to know your interconnect setup first!