What is the best GP...
 
Notifications
Clear all

What is the best GPU for running DeepSeek-V3 locally?

7 Posts
8 Users
0 Reactions
405 Views
0
Topic starter

Hey everyone! I’ve been closely following the benchmarks for DeepSeek-V3 lately, and frankly, I’m blown away by its coding and reasoning capabilities. It seems to be punching way above its weight class compared to other open-weight models, and I really want to get it running locally for some private development work. However, I’m hitting a bit of a wall when it comes to the hardware requirements and could use some expert advice.

Right now, my current setup is just a single RTX 4090 with 24GB of VRAM. It’s been great for Llama 3 and smaller DeepSeek variants, but V3 is a total beast with 671B parameters. Even though it's a Mixture-of-Experts (MoE) model where only about 37B parameters are active at a time, the entire model still needs to reside in memory. From what I’ve read, even with heavy quantization (like 4-bit or even 3-bit GGUF), the VRAM footprint is absolutely massive—we're likely talking 350GB to 400GB+ just to fit the weights.

I’m trying to figure out the most realistic way to handle this without spending $30,000 on an enterprise-grade H100. I’ve seen some hobbyists talking about building multi-GPU rigs using six or eight used RTX 3090s because of the 24GB VRAM and relatively lower cost on the second-hand market. But I’m worried about the logistics: the power draw would be insane, and I’m not sure if my home's circuit could even handle it! Alternatively, is a Mac Studio with an M2/M3 Ultra and 192GB of unified memory a better path, or would the inference speed be too slow for a model of this size?

I’m really looking for that "sweet spot" between performance and price. Is it better to go with a multi-GPU Linux build, or is there a specific workstation-grade card like the RTX 6000 Ada that makes more sense for this specific architecture?

For those of you who have actually managed to get DeepSeek-V3 running with decent tokens-per-second, what specific GPU configuration are you using, and what’s the minimum VRAM you’d recommend to make it usable?


7 Answers
12

- NVIDIA GeForce RTX 3090 24GB stack: Best VRAM/price ratio.
- NVIDIA A100 Tensor Core GPU 80GB: Elite performance, insane cost. Honestly, 200GB+ VRAM is basically mandatory for V3. gl!


10

In my experience, trying to run a 671B model like DeepSeek-V3 is basically like trying to fit a whale in a bathtub if you only have one 4090!! Even with the MoE architecture, the weights are just too massive for a single consumer card. Since the whole model has to stay in memory to avoid a massive performance hit, you’re looking at a huge VRAM wall. Honestly, it's a bit of a nightmare but sooo rewarding once it works. Here’s what I recommend for that sweet spot: 1. **The Multi-GPU Route:** The most cost-effective way is definitely snagging used NVIDIA GeForce RTX 3090 24GB cards. You’d need about 16 of them for a 4-bit quant, which is wild for a home setup. Most hobbyists run a 1.5-bit or 2-bit quantization on a rig with 8x NVIDIA GeForce RTX 3090 24GB. It’s a power hog, but the speed is fantastic compared to anything else.
2. **The Mac Path:** If you dont want your room to turn into a literal sauna, a Apple Mac Studio M2 Ultra with 192GB Unified Memory is probably the "sane" choice. It’s way slower than a multi-GPU Linux beast, but it’s basically plug-and-play. 3. **The Pro Option:** If you have the budget, look into the NVIDIA RTX 6000 Ada Generation 48GB. You’d still need a cluster, but the VRAM density is amazing. Tbh, unless you have a dedicated 20A circuit, 8 GPUs will trip your breakers! I'd personally go with the Mac if you value your sanity. gl! 👍


3

> I’m trying to figure out the most realistic way to handle this without spending $30,000 on an enterprise-grade H100. Stumbled on this while looking for VRAM tips. Honestly, i am super happy with my setup using a few NVIDIA RTX A6000 48GB cards. They are basically the sweet spot for reliability because they are designed for workstations, so they dont overheat as easily as 3090s packed together. You definitely need a motherboard that can handle the lanes, like the ASUS Pro WS WRX90E-SAGE SE WIFI. It is not cheap, but everything just works without crashing. I am running a quantized version of V3 and while its not lightning fast, its totally usable for dev work. The power draw is high but it hasn't tripped my breakers yet... just make sure you have a beefy PSU like the EVGA SuperNOVA 1600 P2 or even two if you go past 4 cards.


3

^ This. Also, be super careful with those multi-GPU DIY builds. I tried cramming a bunch of high-wattage cards into a rig last year and almost started a fire because I didnt realize my old apartments wiring couldnt handle the sustained draw... totally terrifying. It really makes you appreciate the stability of workstation gear even if it costs more upfront. The thread mentions 3090 stacks and the Mac path, which are the usual suspects, but if you want something more reliable for a 24/7 dev box, I would suggest looking into the NVIDIA A40 48GB on the used market. They are basically the data center version of the A6000 but sometimes cheaper since they dont have display outputs. To fit a 671B model like DeepSeek-V3, you are still looking at a massive cluster of them tho. Quick tips:

  • Always check if your PSU has enough individual PCIe cables; dont use splitters for high-end cards.
  • If you go the multi-GPU route, make sure you have a case with serious airflow, like the Fractal Design Meshify 2 XL. Basically, the VRAM wall is real and V3 is a monster. Just dont blow your circuit breaker trying to save a few bucks on a DIY build. It happens way faster than you think when those fans start ramping up.


3

Totally agree with Fallly on the power draw issue, actually had my own breaker trip three times in one night back when I first started building these heater-rigs... not fun when you're in the middle of a long run. After owning multi-card setups for years, the heat is what really gets you in the summer. Honestly tho, instead of us all guessing here, you should really just check out some of the deep dives already done:

  • The r/LocalLLaMA megathread for model requirements
  • Hardware YouTubers who already benchmarked this exact model Ngl there is a massive thread on Reddit that covers the exact VRAM splits for MoE models like this one. Just google deepseek v3 local hardware reddit and it is like the first result that pops up, they have all the charts and everything.


2

Yep, this is the way


2

Big if true


Share: