What is the best GP...
 
Notifications
Clear all

What is the best GPU for running DeepSeek-V3 locally?

7 Posts
8 Users
0 Reactions
546 Views
0
Topic starter

I’ve been hearing a lot about DeepSeek-V3 lately, and the benchmarks for its coding and reasoning capabilities look absolutely incredible. I really want to move away from using the API and try running it locally for privacy and latency reasons, but I’m a bit intimidated by the hardware requirements. Since it’s a Mixture-of-Experts (MoE) model with a massive 671B parameters, I know I can't just run this on a standard gaming setup.

I’m currently planning a build specifically for this model and I'm torn on the GPU path. I’ve been eyeing the RTX 4090 because of the 24GB VRAM, but I’m worried that even with a heavy 4-bit quantization, I might need a multi-GPU setup (like two or three 3090s/4090s) just to fit the weights. I’ve also seen some discussion about using FP8 precision to save on memory, but I’m not sure which consumer cards handle that most efficiently for this specific architecture.

My goal is to get a usable tokens-per-second rate without having to sell a kidney for an enterprise H100. For those of you who have experimented with DeepSeek-V3, what is the best GPU (or combination of GPUs) to get it running smoothly, and what kind of VRAM overhead am I realistically looking at for a quantized version?


7 Answers
10

So basically, running 671B is a total headache lol. - 4090 vs 3090: 4090 is faster, but NVIDIA GeForce RTX 3090 24GB is way cheaper for VRAM. - Best choice: Grab 3-4 used 3090s for ~$750 each. Honestly tho, ur still gonna need massive system RAM offloading for a 671B model unless ur buying like 15 GPUs. Just make sure ur PSU can handle the power draw... it's a lot!! gl!


10

Ok so, jumping in here... I would suggest being reallyyy careful because DeepSeek-V3 is a literal beast. Honestly, even with a heavy 4-bit quant, ur looking at like 350GB+ of VRAM. To run it on consumer gear, you'll probably have to drop to 1.5 or 2-bit quants just to fit it into 140-160GB. - NVIDIA GeForce RTX 3090 24GB: This is the budget king for VRAM. You can chain a few together, but be careful with the power draw. Pros: Cheapest way to get 24GB. Cons: Runs hot and lacks the newer FP8 cores.
- NVIDIA GeForce RTX 4090 24GB: This is way better for V3 specifically because it handles FP8 precision natively. It's much faster, but honestly, the price tag for a multi-GPU setup is scary... you'd need like seven of them for 4-bit. I guess you might want to consider the Apple Mac Studio M2 Ultra 192GB RAM too? It's slower tokens-per-second, but the unified memory makes it way easier to fit the weights without a complex multi-GPU rig. Just make sure to check your cooling!! gl!


3

Solid advice 👍


3

Saving this thread


2

Seconding the recommendation above. Honestly, I've had issues with scaling consumer cards for models this big. I tried building a rig with NVIDIA GeForce RTX 3090 24GB GPUs and it’s basically the only cost-effective way, but unfortunately, it’s a mess to cool. Unless you’re okay with snail speeds, you’ll need a stack of them or NVIDIA GeForce RTX 4090 24GBs. Just dont expect 4-bit to fit easily without a massive budget.


2

From a market analysis perspective, NVIDIA has basically gated VRAM to keep a clear line between consumer 'gaming' cards and professional 'AI' hardware. For a massive MoE like DeepSeek-V3, you’re hitting the limits of the PCIe bus bottleneck on consumer boards, regardless of how many 4090s you chain together. If you want to actually utilize FP8 precision properly, you need cards that support the Transformer Engine. * NVIDIA RTX 6000 Ada Generation: This is the pro-tier choice. You get 48GB of ECC VRAM and much better thermal design for dense mounting. * NVIDIA L40S: If you aren't stuck on a desktop form factor, these are absolute units for inference and handle MoE weight switching way more efficiently than consumer silicon. * AMD Radeon Pro W7900: It’s the only high-VRAM alternative (48GB), but honestly, the software stack for DeepSeek's specific kernels is almost always NVIDIA-first. Tbh, you have to weigh the 'CUDA tax' against the headache of managing a dozen consumer cards. The interconnect speed is what usually kills the tokens-per-second on these giant MoE models, you know?


2

Following this thread


Share: