Hey guys! I have been following the buzz around DeepSeek-V3 lately and the benchmarks are honestly mind-blowing. I really want to move away from API dependencies and get this running locally on my own rig, but I am hitting a wall when it comes to the hardware specs.
Since DeepSeek-V3 is a massive Mixture-of-Experts model with about 671B total parameters, the VRAM requirement is the biggest hurdle. Even with aggressive 4-bit quantization, it seems like a single consumer card just won't cut it. I have been looking at dual RTX 3090 or 4090 setups to try and hit that 48GB or 96GB mark, but I am worried about the interconnect speeds and whether the inference will be painfully slow.
Has anyone actually managed to get this running with decent tokens per second? I am trying to figure out if I should stick with NVIDIA consumer cards or if I need to start looking at something like a Mac Studio with high Unified Memory. I am mostly using it for complex coding tasks and really need it to be relatively responsive.
If you were building a workstation specifically for DeepSeek-V3 today, what GPU configuration would you pick to get the best balance of performance and cost?
Sooo, just catching up on this thread and honestly, running the full 671B DeepSeek-V3 is a literal monster of a task. Basically, even though MoE only uses a few experts for the actual math per token, the whole weight set has to sit in your memory for it to function. For a 4-bit quantization, youre looking at needing somewhere around 400GB to 450GB of VRAM. That is like... 18 NVIDIA GeForce RTX 3090 24GB Graphics Card units. Pretty insane for a home rig, right? If you are building a workstation today and want the best bang for your buck, Ive found that stacking used cards is the only way to go. You can often find a NVIDIA GeForce RTX 3090 24GB Graphics Card for a fraction of the price of a 4090, and they have the same VRAM capacity. But even with four of them, youre only at 96GB. Check out the Ollama or LM Studio tools to see how they handle layer offloading to system RAM. Pro tip: you might find it way more practical to run the DeepSeek-V3-Distill-Qwen-32B version instead. It is much more responsive for coding tasks on a consumer setup. If you are dead set on the massive version, you would probably need a Apple Mac Studio M2 Ultra 192GB RAM and even then, you would have to use a super aggressive 1.5-bit or 2-bit quantization which kinda kills the logic sometimes, tho it technically runs. TL;DR: The full 671B model is way too big for consumer GPUs. Best value is 2x or 4x NVIDIA GeForce RTX 3090 24GB Graphics Card running the distilled versions. Good luck!!
Basically MoE models only use a few experts at a time, but you still gotta fit the whole model in memory for it to even load. This matters cuz VRAM capacity is usually the biggest wall for home users. Just sharing my experience: I went through this last year building a budget rig and it works well. I'm happy with my multi-GPU setup lol.
Look, running V3 at 4-bit is a beastly requirement. You basically need 400GB of VRAM, which is impossible on a standard consumer desktop without a mess of riser cables and three power supplies. Since you want it responsive for coding, you gotta look at VRAM density over just raw clock speed. If you want to stay somewhat cost-conscious but actually get it running, check out the used enterprise market. The NVIDIA RTX A6000 48GB (the older Ampere version) is a solid pick because you get double the VRAM of a 3090 in a single slot. You can fit four of these in a high-end workstation chassis with a Threadripper setup, getting you to 192GB. You still wont hit the 400GB mark for 4-bit, but it lets you run a very usable 2.25-bit or 2.5-bit EXL2 quant. Another sleeper pick for memory capacity is the AMD Radeon Pro W7900 48GB. It is often significantly cheaper than the NVIDIA workstation equivalents for the same 48GB footprint. If you really need that responsive feel, honestly, trying to find a used NVIDIA Tesla A100 80GB PCIe might be better. The memory bandwidth on those is huge compared to consumer gear, so even if you have to run a smaller quant, the actual typing experience won't lag out on you.