What is the best GPU for running DeepSeek-V3 locally?

Question

Hey everyone! I’ve been following the buzz around DeepSeek-V3 lately, and I’m really impressed with its performance benchmarks. I’m dying to get it running locally on my own machine to keep my workflows private and avoid those monthly API costs, but I’m hitting a bit of a wall when it comes to the hardware requirements.

DeepSeek-V3 is a massive 671B parameter Mixture-of-Experts (MoE) model. Even though only about 37B parameters are active at once, the sheer size of the model weights is intimidating. I currently have an older RTX 3080 with only 10GB of VRAM, which I know is way underpowered for something this heavy. I’ve been looking into upgrading to an RTX 4090, but I’m worried that even 24GB won't be enough without extreme 4-bit quantization or maybe a dual-GPU setup using something like NVLink. I’ve also seen some folks suggest that a Mac Studio with high unified memory might actually be more cost-effective for these huge models.

I’m honestly a little overwhelmed by the options and don't want to waste money on a card that can't handle the load. For those of you who have actually managed to get V3 running locally, what’s your setup like? What is the best GPU (or combination of GPUs) you'd recommend for a relatively smooth experience without needing an enterprise-grade A100 budget?

BlackpoolTowerView · Accepted Answer

Quick reply while I have a sec: just sharing my experience: I went through this exact same spiral a few months back. honestly, DeepSeek-V3 is a different beast entirely. i started with a single NVIDIA GeForce RTX 3090 24GB and quickly realized that for a 671B model, even with 4-bit quantization, youre looking at needing roughly 350GB to 400GB of vram to fit the whole thing. it's actually insane. i tried a dual setup with two NVIDIA GeForce RTX 4090 24GB cards, but i was still stuck offloading most of the weights to system ram. the speed was a total crawl... i was getting maybe 0.5 tokens per second. basically unusable for real work. even though only 37B parameters are "active," you still have to store all 671B in memory or the latency of swapping experts will kill your performance. i eventually looked into the Apple Mac Studio M2 Ultra with 192GB Unified Memory. it's probably the most "practical" way to get high memory, but even then, 192GB is only enough for a heavily compressed 1.5-bit or 2-bit quant of V3. to run a decent 4-bit version, youd literally need a rig with like 16 NVIDIA GeForce RTX 3090 24GB cards or a specialized server. so yeah, i would suggest being realy careful. definitely dont expect a single 4090 to handle this beast on its own. it's a massive jump from running 70B models!! gl!

wpgdxuvszk · Answer

tbh if you're trying to be cost-effective, you're basically fighting against the memory bus limitations on consumer cards. v3 needs such high vram capacity that even with a 4-bit quantization setup, you're looking at massive memory bandwidth bottlenecks. i was actually looking at the pcie lane distribution on some older server boards last night to see if i could optimize the data transfer rates. it kinda reminds me of this time i tried to diy a custom water cooling loop for my entire home office. i spent like three hundred bucks on copper piping from the hardware store and ended up flooding my basement because i didn't account for the pressure drop over the long distance. my cat was so confused why there was a literal river under the desk lol. i spent the whole night shop-vacing the carpet instead of finishing my build. anyway lol sorry kinda went off topic there.

ttdflrjefj · Answer

Did this last week, worked perfectly