What is the best GPU for running DeepSeek models locally?

Question

What GPU setup is actually hitting the sweet spot for the DeepSeek-V3 or R1 models right now without breaking the bank? Ive been running local LLMs since the early Llama days and I've got my workflow down pretty well but these newer DeepSeek weights are a different beast entirely. The MoE architecture is really throwing me for a loop with how it handles VRAM compared to the standard dense models I'm used to.

Right now I'm sitting on a single 3060 which is obviously a joke for this. I tried running the 67B version and it was basically a slide show, maybe 0.5 tokens per second if I was lucky. Im looking to spend around $1800 maybe $2000 if I can stretch it but I'm really torn between a few options:

single RTX 4090 for the raw speed

dual used RTX 3090s for the 48GB VRAM pool

maybe a Mac Studio with the M2 Ultra but thats pushing my budget

I'm based in Oregon so I can hit up some local shops but mostly looking online. My main use case is local coding assistance and some massive PDF analysis for work so speed is kinda important but I need enough VRAM to actually fit the model at a decent quant like Q4 or Q5. I heard the unified memory on Mac is better for the huge DeepSeek models but I'm worried about the actual compute speed compared to NVIDIA for the MoE layers. Anyone running the R1 70B or the full V3 locally with a setup that doesnt cost as much as a used car?

Spravkiqgd · Accepted Answer

Unfortunately, dual 3090s were a disappointment due to heat crashes. I'd suggest a Apple Mac Studio M2 Max 64GB for reliability.Unified memory is safer.Avoids driver issues.

remont_etol · Answer

tl;dr: Dual NVIDIA GeForce RTX 3090 24GB GDDR6X is the play. 48GB VRAM is the sweet spot for the 70B models at decent quants without spending a fortune. I went through this exact headache a few months back. I started with a single NVIDIA GeForce RTX 4090 24GB GDDR6X thinking the raw speed would save me, but man, those MoE models are VRAM hogs. Once you spill over into system memory, it doesnt matter how fast your GPU is... it just crawls. I ended up hunting for deals and grabbed two used 3090s for about $700-800 each and it changed everything. With 48GB total, I can run the DeepSeek-R1-Distill-Llama-70B at Q4 or Q5 quants comfortably. For coding and PDF work, you really want that higher quantization so the logic doesnt fall apart. On my dual setup, I get around 12-15 tokens per second. Its not lightning fast like a 4090 on a small model, but its consistent and actually usable for real work. The Mac route is cool for the unified memory, but honestly, unless you spend way over $2000 on an Ultra with 128GB+ RAM, you wont be running the full 671B model at a decent speed anyway. The dual 3090s fit your budget perfectly, even with a beefy PSU like the EVGA SuperNOVA 1600 G+ which you are definitely gonna need. Just watch the thermals because those cards get pretty toasty when they are sandwiched together in a mid-tower.