What is the best GPU for running DeepSeek models locally?

Question

I’ve been following the DeepSeek releases lately, and I’m honestly blown away by the performance of models like DeepSeek-V3 and the R1 reasoning models. I really want to move away from relying on their API and get a solid local setup running so I can experiment privately and without worrying about per-token costs.

Currently, I’m rocking an older RTX 3060 with 12GB of VRAM, but it’s clearly hitting its limit. While it handles the smaller distilled versions okay, it completely chokes when I try to run the 32B or the 70B quantized versions at any usable speed. I’ve been debating whether to shell out for a single RTX 4090 for the speed or if I should look for a used dual RTX 3090 setup to maximize VRAM for the price. I’ve also heard mixed things about using Mac Studio setups with unified memory for these massive parameter counts, which has me a bit confused about the best path forward.

My main concern is the VRAM bottleneck. Since DeepSeek models are quite beefy, I don’t want to invest a ton of money only to realize I can’t fit the necessary layers into memory for a smooth experience. Given the current hardware landscape, what would you say is the best 'sweet spot' GPU—or multi-GPU configuration—specifically for running DeepSeek's larger models locally right now?

sqmfhxqzzi · Accepted Answer

sooo i been thinking about your question and honestly, if you're looking at those beefy DeepSeek models, VRAM is literally the only thing that matters. you're 100% right that the 12gb on your current card is hitting a wall. i've been there and it's super frustrating when things just choke mid-inference or give you like 1 token per second lol.

For your situation, here's what I recommend:
- Dual NVIDIA GeForce RTX 3090 24GB GDDR6X setup: This is the absolute sweet spot for value right now. 48GB of VRAM lets you run the 70B models at decent quants (like Q4_K_M) with a solid context window. You can usually find these used for around $700-850 each on eBay.
- Mac Studio with M2 Ultra 128GB Unified Memory: if you want to avoid the headache of multi-GPU driver stuff and crazy power consumption, this is highkey the best way to run huge models. It's slower than NVIDIA for raw speed, but that unified memory is a lifesaver for massive parameter counts.
- Skip a single NVIDIA GeForce RTX 4090 24GB GDDR6X: I mean, it's a beast for speed, but 24GB is still gonna bottleneck you on 70B+ DeepSeek versions unless you squash them down to 2-bit or 3-bit quants which... honestly kinda sucks for reasoning quality.

Basically, if you can handle the heat and space, go dual 3090s. It's the most practical way to get that 48GB buffer without spending enterprise money. Just make sure ur PSU is like 1000W+ or you're gonna have a bad time. anyway, hope that helps! peace.

Dizaynersk_xnsl · Answer

In my experience, I would suggest going for a used dual NVIDIA GeForce RTX 3090 24GB GDDR6X setup. Honestly, 48GB of VRAM is the real sweet spot for running those beefy 70B DeepSeek models smoothly. A single NVIDIA GeForce RTX 4090 24GB GDDR6X is faster for smaller stuff, but you'll just hit that VRAM wall again. Just make sure your PSU can handle the draw... it's a lot!! gl

zmjlonosye · Answer

Stumbled upon this discussion and wanted to add a DIY enthusiast angle. Before you drop a few thousand on hardware, my advice is to rent a rig for an hour on a site like RunPod or Vast.ai. It is basically the best way to test how DeepSeek-V3 or R1 quants actually feel on different VRAM setups without committing. If you are okay with a project, look into the NVIDIA Tesla P40. They are old enterprise cards with 24GB of VRAM that you can find for a fraction of the cost of a 3090. Youll have to DIY some cooling with 3D printed shrouds and blowers, but it is a solid way to stack VRAM for cheap. Also, keep an eye on the GGUF model cards on Hugging Face; they usually have a breakdown of exactly how much memory you need for each quant level so you dont overspend. TL;DR: Use cloud rentals to find your preferred quantization first, then consider older enterprise cards for a budget-friendly VRAM build.

Rent an A6000 on RunPod to test quants

Look into DIY cooling for older server GPUs

Use LM Studio for easy memory estimation