Hey everyone! I’ve been following the benchmarks for DeepSeek-V3, and the performance is honestly mind-blowing for an open-weight model. I really want to move away from using their API and get this running locally for my coding projects and data analysis, but I’m a bit overwhelmed by the actual hardware requirements.
Since V3 is a massive Mixture-of-Experts (MoE) model with 671B parameters, I know we’re talking about a huge amount of VRAM, even with heavy quantization like GGUF or EXL2. I’ve read conflicting reports on what’s actually needed; some say you need over 300GB of VRAM to run it at a usable bit-rate, while others are trying to squeeze it into smaller setups. I’m currently sitting on a single RTX 4090, but I know that’s nowhere near enough for a model of this scale.
I’m trying to decide if I should build a multi-GPU rig with several used 3090s linked together, or if I should just bite the bullet and look at a Mac Studio with 192GB of unified memory—though I'm worried about the tokens per second on Apple Silicon. Has anyone here actually successfully loaded the V3 weights locally? What’s the 'sweet spot' GPU setup you'd recommend to get decent inference speeds without spending a fortune on enterprise-grade H100s?
Just sharing my experience: I've spent way too much time over the years chasing the VRAM dragon. To get DeepSeek-V3 running, I actually tried building a cluster of four NVIDIA GeForce RTX 3090 24GB cards. It's a total beast to manage because of the heat, and honestly, even with 96GB, you're looking at heavy quantization.
Compare that to my old server rig with dual NVIDIA RTX 6000 Ada Generation 48GB—the bandwidth is insane, but the price tag is basically a down payment on a house. If you go the multi-GPU route, the latency between cards via PCIe can really kill your tokens per second unless you have NVLink, which the 40-series doesn't even support... so yeah, it's a tradeoff. Apple Silicon is definitely easier for the massive memory pool, but those 3090s definitely felt punchier for raw speed in my tests.
yo! i totally feel u, trying to run deepseek-v3 is like trying to park a tank in a garage lol. its a massive MoE model so honestly, 671B parameters is just huge for consumer hardware. been thinking about ur question and if ur worried about budget, i gotta say... be careful with the heat and power draw on those old cards!!
for a practical and budget-conscious setup, i would suggest looking at these specific options:
* NVIDIA Tesla P40 24GB - these are super cheap on the used market. you can chain like 8-10 of them together if you have the space and a beefy power supply. definitely the cheapest way to get high VRAM, though they're kinda slow.
* NVIDIA RTX 3090 24GB - still the king for budget speed. if you can find them used, they're amazing, but getting 4+ to play nice is a total headache with cooling.
* ASRock Rack ROMED8-2T - basically a server motherboard that lets you plug in tons of GPUs without losing bandwidth.
but honestly, even with NVIDIA GeForce RTX 3090 cards, you're gonna need at least 6-8 of them to run a decent quant of V3 without it crawling. i mean, i guess you could try a super heavy 1.5-bit or 2-bit quant, but the logic might get a bit wonky. maybe start with two more used 3090s and see how it feels? gl! 👍
Honestly, I feel u on the VRAM struggle!! DeepSeek-V3 is a total beast. I actually tried running a 4-bit quant of it and seriously, unless you have like 400GB+ of VRAM to run it unquantized, you gotta compromise. Here's what I recommend: for the best bang-for-your-buck, definitely go the multi-GPU route with a few NVIDIA GeForce RTX 3090 24GB cards. I built a rig with four of those linked via NVLink and it’s honestly amazing for the price compared to enterprise gear. Each card gives u 24GB, so 4-6 cards is the sweet spot for MoE models like this.
I was looking at the Apple Mac Studio M2 Ultra with 192GB Unified Memory too, and while the VRAM is huge, the tokens per second are kinda mid for a model this size compared to the raw power of multiple GPUs. If u can find some used 3090s for like $700 each, it's a steal. Just make sure ur PSU can handle the juice lol. gl with the build!!
Saved for later, ty!
Yep, this is the way
Totally agree with the warnings about heat and power draw. Building a 4x 3090 rig sounds cool until you realize youre basically running a space heater in your office. For a more stable setup, Id suggest looking at workstation cards instead of gaming ones. Quick tip: compare the AMD Radeon Pro W7900 48GB against NVIDIA. AMD is way more budget-friendly for high VRAM tasks now, and ROCm works great with DeepSeek-V3. If you must have CUDA, a used NVIDIA RTX A6000 48GB is a much safer bet than a bunch of 3090s. The ECC memory and blower design are built for these heavy loads... gaming cards just arent.
To add to the point above, I have to politely disagree with the 3090 or 4090 recommendations if you're planning on long inference sessions. Over the years, Ive seen so many DIY rigs fail because people ignore the thermal density. A stack of four gaming cards is basically a space heater that throttles itself within minutes. In my experience, the real sweet spot for a model like DeepSeek-V3 is hunting for a used NVIDIA RTX A6000 48GB. Since it’s the Ampere architecture, you avoid the massive price tag of the newer chips while getting professional-grade stability.
🙌