What is the best GP...
 
Notifications
Clear all

What is the best GPU for running DeepSeek models locally?

6 Posts
7 Users
0 Reactions
368 Views
0
Topic starter

I’ve been following the DeepSeek releases lately, and I’m honestly blown away by the performance of models like DeepSeek-V3 and the R1 reasoning models. I really want to move away from relying on their API and get a solid local setup running so I can experiment privately and without worrying about per-token costs.

Currently, I’m rocking an older RTX 3060 with 12GB of VRAM, but it’s clearly hitting its limit. While it handles the smaller distilled versions okay, it completely chokes when I try to run the 32B or the 70B quantized versions at any usable speed. I’ve been debating whether to shell out for a single RTX 4090 for the speed or if I should look for a used dual RTX 3090 setup to maximize VRAM for the price. I’ve also heard mixed things about using Mac Studio setups with unified memory for these massive parameter counts, which has me a bit confused about the best path forward.

My main concern is the VRAM bottleneck. Since DeepSeek models are quite beefy, I don’t want to invest a ton of money only to realize I can’t fit the necessary layers into memory for a smooth experience. Given the current hardware landscape, what would you say is the best 'sweet spot' GPU—or multi-GPU configuration—specifically for running DeepSeek's larger models locally right now?


6 Answers
12

sooo i been thinking about your question and honestly, if you're looking at those beefy DeepSeek models, VRAM is literally the only thing that matters. you're 100% right that the 12gb on your current card is hitting a wall. i've been there and it's super frustrating when things just choke mid-inference or give you like 1 token per second lol.

For your situation, here's what I recommend:
- Dual NVIDIA GeForce RTX 3090 24GB GDDR6X setup: This is the absolute sweet spot for value right now. 48GB of VRAM lets you run the 70B models at decent quants (like Q4_K_M) with a solid context window. You can usually find these used for around $700-850 each on eBay.
- Mac Studio with M2 Ultra 128GB Unified Memory: if you want to avoid the headache of multi-GPU driver stuff and crazy power consumption, this is highkey the best way to run huge models. It's slower than NVIDIA for raw speed, but that unified memory is a lifesaver for massive parameter counts.
- Skip a single NVIDIA GeForce RTX 4090 24GB GDDR6X: I mean, it's a beast for speed, but 24GB is still gonna bottleneck you on 70B+ DeepSeek versions unless you squash them down to 2-bit or 3-bit quants which... honestly kinda sucks for reasoning quality.

Basically, if you can handle the heat and space, go dual 3090s. It's the most practical way to get that 48GB buffer without spending enterprise money. Just make sure ur PSU is like 1000W+ or you're gonna have a bad time. anyway, hope that helps! peace.


10

In my experience, I would suggest going for a used dual NVIDIA GeForce RTX 3090 24GB GDDR6X setup. Honestly, 48GB of VRAM is the real sweet spot for running those beefy 70B DeepSeek models smoothly. A single NVIDIA GeForce RTX 4090 24GB GDDR6X is faster for smaller stuff, but you'll just hit that VRAM wall again. Just make sure your PSU can handle the draw... it's a lot!! gl


3

Stumbled upon this discussion and wanted to add a DIY enthusiast angle. Before you drop a few thousand on hardware, my advice is to rent a rig for an hour on a site like RunPod or Vast.ai. It is basically the best way to test how DeepSeek-V3 or R1 quants actually feel on different VRAM setups without committing. If you are okay with a project, look into the NVIDIA Tesla P40. They are old enterprise cards with 24GB of VRAM that you can find for a fraction of the cost of a 3090. Youll have to DIY some cooling with 3D printed shrouds and blowers, but it is a solid way to stack VRAM for cheap. Also, keep an eye on the GGUF model cards on Hugging Face; they usually have a breakdown of exactly how much memory you need for each quant level so you dont overspend. TL;DR: Use cloud rentals to find your preferred quantization first, then consider older enterprise cards for a budget-friendly VRAM build.

  • Rent an A6000 on RunPod to test quants
  • Look into DIY cooling for older server GPUs
  • Use LM Studio for easy memory estimation


3

No way, I literally just dealt with this yesterday. Small world.


2

Great info, saved!


2

> I’ve also heard mixed things about using Mac Studio setups with unified memory for these massive parameter counts, which has me a bit confused about the best path forward. Sooo, it’s basically a toss-up between raw speed and sheer memory capacity right now. Honestly, the market for DeepSeek-capable hardware is kinda wild because Nvidia has the "speed" crown with CUDA, but the Apple ecosystem lets you access way more memory for those huge parameter counts without the headache of a massive power supply. Like, it’s much easier to fit a 70B model into unified memory, but the tokens per second might feel sluggish compared to a multi-GPU PC setup. But before you dive in, what’s your actual budget? And are you just looking for fast chat responses (inference), or are you planning to do any fine-tuning or training? You should definitely check out some of the community-run performance databases like the ones on the LocalLlama sub or the "LLM-Viewer" site. They’re reallyyy helpful for seeing how different brands handle the DeepSeek architecture specifically. It’s honestly eye-opening to see how much the memory bandwidth varies between a custom workstation build and a high-end consumer setup!


Share: