What is the best GP...
 
Notifications
Clear all

What is the best GPU for running DeepSeek-V3 locally?

8 Posts
9 Users
0 Reactions
616 Views
0
Topic starter

Hey everyone! I’ve been blown away by the benchmarks for DeepSeek-V3 lately, and I’m really itching to get it running locally on my own machine. I’m currently trying to move away from relying on cloud APIs for my coding assistant and research tasks, but I’m hitting a bit of a wall when it comes to hardware requirements.

Since DeepSeek-V3 is a massive 671B parameter Mixture-of-Experts (MoE) model, I know it’s a total beast compared to smaller models I've used before. I’ve been looking at my current setup—a single RTX 4090 with 24GB of VRAM—and I’m starting to realize that’s probably not going to cut it, even with heavy 4-bit quantization or GGUF offloading to system RAM. I'm curious if anyone has successfully run this on a multi-GPU setup, like dual 3090s/4090s or maybe even a Mac Studio with high unified memory.

I really want to achieve decent tokens-per-second without having to sell a kidney for enterprise-grade H100s. Has anyone found a 'sweet spot' for consumer or prosumer hardware that can actually handle this model efficiently? If you’re running it locally, what’s your specific GPU setup and what kind of inference speeds are you seeing?


Topic Tags
8 Answers
12

Honestly, if you wanna run V3 without breaking the bank, skip the Mac Studio talk and look at a multi-GPU rig with older pro cards. I'd suggest grabbing a few NVIDIA RTX 6000 Ada Generation if you can swing it, but for a real budget win, go with four NVIDIA RTX 3090 24GB connected via NVLink. You basically gotta have at least 192GB of VRAM to run a decent IQ4_XS quant, and used 3090s are literally the sweet spot for cost-per-GB right now. gl!


11

ngl trying to run DeepSeek-V3 on a single card is basically impossible unless you wanna wait minutes for one sentence. Since its a 671B MoE model, you really need like 700GB+ of VRAM to run it at FP8, or at least 400GB+ for decent quantization.

Here's what I recommend based on what I've seen working for others:

1. **The Mac Studio Route:** Honestly, a Apple Mac Studio M2 Ultra with 192GB Unified Memory is probably the most "cost-effective" way to get it running without a server rack. It wont be blazing fast, but the unified memory lets you fit the whole thing.
2. **Multi-GPU PC:** If you wanna stick to PC, you're looking at a 4x or 8x GPU setup. People are using NVIDIA GeForce RTX 3090 24GB cards in NVLink pairs because they're cheaper than NVIDIA GeForce RTX 4090 24GB and have the same VRAM. You'd still need like four of them to even start sniffing the 4-bit version.

Personally, I think the Mac Studio is the "sweet spot" for prosumers cuz you dont have to deal with massive power supplies and heat, but yeah, its still pretty pricey!


3

Regarding what #2 said about "ngl trying to run DeepSeek-V3 on a single..." - hes totally right. I tried making it work on my old rig and unfortunately it was a mess. Had issues with memory leaks and the heat was honestly terrifying. It just wasnt as good as expected for real work and I ended up pretty disappointed with the whole experience... If youre worried about reliability like I am, just stick with NVIDIA. You really cant go wrong with their professional line tho. Just get any of those workstation-grade cards because the cooling and driver support are actually built for this kind of constant load. Its way better than worrying about your PC crashing in the middle of a deep research session. Reach out if you need more info on how to set it up, I can probably walk you through the driver stuff.


3

Just catching up on this thread. I tried setting up a multi-card rig a few weeks ago for these giant models and it was a bit of a nightmare tbh. I thought having enough total memory was the only thing that mattered but I was wrong. Here is what I learned from my struggle:

  • Be careful with how you space the cards because the heat was insane.
  • You might want to consider the power draw; I actually tripped a breaker twice before upgrading my wall outlet.
  • Make sure to look at the motherboard lanes because my second card was running at x4 speed and it was painfully slow. Honestly, I ended up with something that barely gave me a few tokens a second. I would suggest really looking into the bandwidth between cards before buying anything because that was the biggest bottleneck for my performance.


2

sooo i tried running it on my current setup and honestly... it was pretty brutal lol. unfortunately i had issues with speed and it just wasnt as good as expected on a single card. tbh for something this massive, you basically gotta look at Apple. i mean just get any high-memory Mac with unified RAM cuz its way easier to deal with than stacking cards. gl!


1

Subbing for updates


1

tbh i tried a few different builds for these massive models recently and it was unfortunately way more frustrating than i thought it would be. had issues with bandwidth bottlenecks and the speed just wasnt as good as expected for real-time use. since youre looking to diy this without going broke, here is the direction i would head in:

  • go with older professional cards from NVIDIA, you really cant go wrong with the ones that have high vram
  • just get any server chassis that supports multiple double-width cards
  • stick to enterprise-grade stuff from NVIDIA if you want to avoid the memory errors i kept hitting it sucks that consumer stuff is falling behind so fast tho. if you run into driver issues or need help with the power supply side of things just hit me up... ive spent way too long troubleshooting these things myself.


1

Quick reply while I have a sec. @Reply #6 - good point! Bandwidth and heat are the real killers when you scale up these builds. I have been running a multi-GPU workstation for a few years now and honestly, I am super satisfied with the workstation route over gaming cards. I ended up snagging four NVIDIA RTX A6000 48GB units used. Since they are blower style cards, I dont have to worry about them choking each other out like open-air 4090s do in a tight case. With 192GB of VRAM, you can run DeepSeek-V3 at an IQ2_M quantization. It is not full precision, sure, but for coding and logic tasks it works well and the perplexity loss is basically unnoticeable in daily use. No complaints so far, it has been rock solid for months. The key is getting a motherboard that actually has the lanes. I use the ASRock Rack ROMED8-2T which handles the PCIe traffic properly. If you try to run these massive models on a standard consumer board with split lanes, you are gonna hit a massive bottleneck on the MoE switching. TL;DR: Grab used A6000 48GB cards for the VRAM density. You will need at least four to even load the model at a heavy quant, but it is the most stable way to do it without buying an actual server rack.


Share: