What is the best GP...
 
Notifications
Clear all

What is the best GPU for running DeepSeek-V3 locally?

9 Posts
10 Users
0 Reactions
501 Views
0
Topic starter

I'm really hyped about DeepSeek-V3, but with 671B parameters, I’m scratching my head over the hardware. I want to run it locally with decent speed. Should I aim for multiple RTX 4090s, or is Mac Studio's unified memory the better play here? What’s the sweet spot for VRAM and quantization to keep it snappy?


9 Answers
11

Ok so, seconding the recommendation above! Those VRAM bottlenecks are NO JOKE. From my experience, building rigs since the GTX days, the *best* bang-for-buck move right now is actually chaining multiple NVIDIA GeForce RTX 3090 24GB cards.

They're way cheaper than 4090s and still have that sweet 24GB VRAM. You'll need like 16 of them to run DeepSeek-V3 properly, so look at server-grade stuff like the ASUS ESC8000 G4 GPU Server. Seriously, it's a massive project! TL;DR: Go used 3090s or rent cloud H100s cuz 671B is absolute madness for home gear lol.


10

> What is the best GPU for running DeepSeek-V3 locally? ... Should I aim for multiple RTX 4090s, or is Mac Studio's unified memory the better play here?

For your situation, man, honestly... running a beast like V3 locally is a massive challenge. Basically, even with heavy quantization, ur looking at needing way over 350GB of VRAM to run the full 671B params at 4-bit. Since DeepSeek-V3 is a Mixture-of-Experts (MoE) model, only about 37B params are active at once, but you still gotta store the whole thing in memory to get decent speed.

Here's what I recommend based on what I've seen:

**Option A: The Multi-GPU PC Build**
If you go with the NVIDIA GeForce RTX 4090 24GB, you'd literally need like 16 cards to fit the model at a decent bit-rate. Even with the NVIDIA RTX 6000 Ada Generation 48GB, you're looking at a massive server rack setup.
- **Pros:** Seriously fast tokens per second (TPS) if you have the PCIe bandwidth.
- **Cons:** The power bill and heat will be insane. Plus, the cost of 8-10 cards is eye-watering.

**Option B: The Mac Studio Route**
I think the Apple Mac Studio M2 Ultra with 192GB Unified Memory is actually the more "practical" play here, even tho it's still pricey. You'd have to use a very aggressive quant (like IQ2_M or IQ2_XS) to fit it into 192GB, but it'll actually run on a single desktop footprint.
- **Pros:** Huge unified memory is the only way to avoid buying a server.
- **Cons:** It's gonna be slower than an NVIDIA cluster. Maybe 1-3 tokens per second?

Personally, I guess the sweet spot for a "sane" home user is probably the Mac, but don't expect it to be snappy like a smaller model. It's more for deep thinking tasks where you don't mind waiting a bit. gl! 👍


3

Yeah, I am still kind of a beginner at this, but I have been running my own setup for a while now and the long term stuff is what really matters. Honestly, if you want to avoid a massive headache, I would say just go with NVIDIA. You basically cant go wrong with their stuff because the software is just so much more beginner friendly than anything else out there right now. Basically, I would look at getting a setup with their professional line of cards instead of the gaming ones. They are built to run for a long time and they dont seem to have as many weird driver issues when you start plugging a bunch of them together. I have spent way too many nights trying to fix software bugs, and sticking to that one ecosystem has made my life way easier. Just get whatever high memory cards you can find from that brand and you should be good to go. It is way less stressful than trying to mix different brands or weird setups that might break with the next update. Plus, if you ever decide to quit, the resale value for those specific cards is usually really solid.


3

@Reply #8 - good point! Reading through this, everyone is talking about the cards but people keep forgetting the motherboard bottlenecks. I totally agree that used cards are the budget play here, but you gotta be real careful about your PCIe lanes. If you try to stick four NVIDIA GeForce RTX 3090 24GB cards into a standard consumer board, you are gonna hit a massive wall with bandwidth. I would suggest looking for a used Supermicro H11DSi or similar EPYC board on eBay. They have the lanes you need and they're usually pretty cheap if they're server pulls. Just watch out for the proprietary power connectors tho. Also, make sure to check your physical clearance. Most 3090s are 3-slot monsters. Unless you find the blower-style ASUS Turbo GeForce RTX 3090, you are basically forced into an open-air mining frame or a really expensive rackmount case. Dont forget you'll need at least two EVGA SuperNova 1600 G+ units to handle the transient spikes without tripping. Its a lot of moving parts to keep compatible, so just double check everything before buying.


2

Yeah, I totally agree that the multi-GPU route is basically the only way, but the idea of sixteen consumer cards in one room makes me a bit nervous about the power bill and heat. Tbh, if you're looking at this from a market perspective, there are some enterprise brands that might be more reliable if you have the budget. I’ve been doing some research into a few alternatives that are a bit more 'stable' than a home-brew rig: 1. AMD Instinct MI210: These are starting to pop up more on the secondary market. With 64GB of VRAM, you don't need nearly as many cards to hit that 400GB+ requirement. I’m a little unsure if the ROCm support is 100% there for DeepSeek-V3 yet, but it's a beefy option.
2. NVIDIA L40S: These are specifically built for AI inference. You get 48GB VRAM per card and much better thermal management than a bunch of gaming GPUs. 3. NVIDIA A100 80GB PCIe: If you can find these refurbished, it's the safest bet for uptime. Using 80GB modules means fewer points of failure in the chain. So basically, you gotta weigh the 'cheap but risky' DIY approach against the enterprise gear. Anyway, it's a massive investment regardless, so good luck with whatever you go with!


2

Honestly, that point about the heat and the power bill is spot on. I’ve been a DIY guy for a long time, and I've learned the hard way that once you start chaining a massive amount of hardware together, it stops being a fun project and starts being a full-time job. I mean, I love the control of a self-built rig, but the hidden cost is the sheer amount of time you spend on maintenance. Basically, if you go the home-brew route for something as massive as DeepSeek-V3, just be prepared for it to become a second job just keeping the nodes talking to each other. I mean, you'll probably spend more time troubleshooting your CUDA environment or PCIe bandwidth issues than actually running the model.


2

This thread is gold. Bookmarking for future reference 🔖


1

Honestly, you're gonna have a bad time trying to fit 671B params on a home setup without a massive budget. For your situation, I'd watch out for:

- VRAM Bottlenecks: Even at 4-bit, you need like 400GB+ VRAM.
- Power Draw: Multiple high-end cards will literally trip your breakers.
- Slow Inference: Spreading this across consumer GPUs usually kills the speed.

I tried a multi-GPU rig recently and it was way noisier than expected, plus the setup was a total headache. Maybe just stick to the API for now? lol


1

Yep been there done that. Can confirm everything said above is spot on.


Share: