What is the best qu...
 
Notifications
Clear all

What is the best quantization method for DeepSeek models?

10 Posts
10 Users
0 Reactions
428 Views
0
Topic starter

Hey everyone! I’m finally diving into local hosting for the DeepSeek models, but I’m a bit torn on which quantization method to go with. I’m running a setup with dual 3090s (48GB VRAM total), so I have some room to play with, but these models are still massive. I’ve been looking at GGUF for ease of use, but I’ve heard EXL2 or AWQ might offer better inference speeds. I really don’t want to sacrifice too much of that impressive reasoning logic just to save space. Has anyone compared 4-bit vs 8-bit perplexity on these specifically? What’s your recommended 'sweet spot' for balancing speed and output quality for DeepSeek right now?


10 Answers
11

In my experience, I've tried many formats and EXL2 is the move for ur dual NVIDIA GeForce RTX 3090 24GB setup.
- EXL2 vs GGUF: EXL2 has way faster inference and basically zero logic loss at 5.0 bpw.
- GGUF/AWQ: Easier but highkey slower on dual gpus.
Best Choice: 5.0 bpw EXL2 is the sweet spot for DeepSeek logic!! gl!


11

Yo, I went through this last month! Basically, quantization is like squishing a massive file—it saves space but can lose detail. For dual-GPU setups, you gotta watch out for P2P bottlenecks or your speed will be toast.

Just sharing my experience:
- I started with GGUF Q4_K_M in llama.cpp, but it was highkey sluggish on two cards.
- Then I tried DeepSeek-V3 in AWQ with vLLM—speed was okay, but the setup was a total nightmare.
- I finally settled on EXL2 at 4.65 bpw. It's FANTASTIC for balancing reasoning logic and speed on dual NVIDIA GeForce RTX 3090 24GB cards.

Seriously, EXL2 is the winner for that 48GB VRAM limit. Good luck!!


4

bump


3

Tbh if you look at the current market trends for local LLMs, everyone is shifting away from just saving space and focusing on compute-aware quants. Since you have 48GB, you are in a realy sweet spot where you can move beyond the basic hobbyist formats. From a market research perspective, the industry is split between ease-of-use and raw enterprise performance.

  • vLLM FP8 is basically the professional standard right now. Most companies deploying DeepSeek models at scale use FP8 because it keeps the reasoning logic almost identical to the base model while being way faster on modern GPUs. It is definitely the most stable choice for high-end logic.
  • Unsloth is the big disruptor for the local community. Their dynamic quantization methods are realy gaining ground because they focus on keeping the model smart rather than just small. It is a fantastic middle ground for dual 3090 setups if you want to balance VRAM and smarts.
  • BitNet b1.58 is the research-grade future everyone is talking about, though it is still early days for wide DeepSeek support. It represents a massive shift in how the market views efficiency vs performance. Honestly, dont sleep on FP8!!! If you can fit your chosen DeepSeek version in that format, the logic retention is night and day compared to standard 4-bit methods. You spent the money on dual 3090s, so you might as well use the format that doesnt lobotomize the reasoning. Who did you end up going with for your first test run???


3

Just catching up on this thread and honestly I totally agree with the point about shifting towards those professional standards. As someone who just started trying to do this myself over the last few days, I realized that having the hardware is only half the battle. I spent forever just trying to get the environment right because I really wanted to go the DIY route instead of just paying for a monthly professional service. Here is what I have learned while messing around with my current setup:

  • Doing it yourself is way harder than it looks in the tutorials because the software side is constantly changing.
  • I noticed that if I did not get the memory settings exactly right the whole thing would just crash or go super slow.
  • The output quality really depends on the little tweaks you make in the config files that nobody tells you about. Does anyone else feel like they spend way more time troubleshooting the setup than actually using the model for work? It is a bit of a learning curve for sure but it is pretty cool once it finally works.


3

Exactly what I was thinking


3

Yep been there done that. Can confirm everything said above is spot on.


1

Saved for later, ty!


1

I have spent quite a long time testing various quantization setups for the larger DeepSeek weights, and unfortunately, the results have been pretty disappointing compared to the full precision versions. When you have a decent amount of VRAM, you expect a certain level of fluidity that many of these methods just dont provide.

  • I tried the standard 4-bit approach initially, but the logic errors in long-form reasoning were just too frequent for my professional work.
  • My experience with the common formats on my dual NVIDIA GeForce RTX 3090 24GB setup was lackluster due to the overhead; it felt like I was wasting the potential of the hardware.
  • Even the higher bit-rate methods I tested didnt quite hit that perfect balance of speed and intelligence I was looking for. Basically, I moved away from the standard hobbyist formats because they just couldnt maintain the models integrity during complex tasks. It is a bit of a letdown when you realize that even with high-end hardware, you are still making massive compromises on the reasoning quality...


1

I have been obsessing over these exact metrics for way too long now and it is seriously draining. Like someone mentioned, the logic errors in long-form reasoning are the real killer. It is one thing to see a decent perplexity score on paper, but it is another thing entirely when the model starts hallucinating halfway through a Python script because of some bit-depth compression issues. My main frustrations and things you really need to be careful with:

  • The KV cache pressure on dual 3090s is way more volatile than people admit, especially when you push the context length.
  • Low-bit quants often have these hidden cliff edges where performance doesnt just degrade slowly—it completely falls apart.
  • Many of the standard benchmarks we use for quantization just dont capture the subtle logic shifts in DeepSeek models specifically. Honestly, it is just discouraging how much time we have to spend debugging the formats instead of actually using the models. The lack of consistency across different backends is driving me crazy lately.


Share: