What is the best GPU to run DeepSeek 67B locally?

Question

Ive been spending way too much money on API calls for my coding side projects lately and finally decided its time to just build a dedicated rig for local LLMs. Ive been looking specifically at DeepSeek 67B because the coding performance looks insane compared to Llama 3 for my specific workflow. I did some digging on Reddit and some people say a single RTX 4090 is enough if you use 4-bit quantization but then others are claiming you really need dual 3090s to get any decent speed or to run it at higher precision without it crawling at 1 token per second. Im really torn because my budget is around 2300 Euro and I'm based in Berlin so electricity is pretty pricey here... running two older 3090s sounds like a space heater and a power hog. I also saw some folks mentioning the Mac Studio with M2 Ultra but thats way out of my price range for the RAM I would need. If I go the 4090 route will I regret the 24GB VRAM limit almost immediately? Or should I look into those used Tesla cards? Its all a bit overwhelming trying to figure out the actual VRAM math for DeepSeek specifically when you factor in the context window. What are you guys actually using to get smooth performance out of this model?

yqdtzzltpm · Accepted Answer

Just saw this. Over the years I've realized VRAM math is a lie once context kicks in. Quick question tho: what's your target context length? I tried a single 4090 and it crawled once I fed it a large file.

Hunt for a used NVIDIA RTX A6000 48GB. Its more efficient than dual 3090s for Berlin.

Get a Seasonic PRIME TX-1000 1000W 80+ Titanium to save your bill.

eqoyxignqj · Answer

Definitely grab two used NVIDIA GeForce RTX 3090 24GB GDDR6X cards! That 48GB total VRAM is amazing for 4-bit quants and fits your budget perfectly, unlike a single 4090!

3www3 · Answer

I went through this exact same headache last year when I started hosting my own coding assistants. I started with just one high-end card thinking 24GB would be plenty, but I hit a wall fast once my context window grew. To run this model comfortably, you're likely gonna need more than one card.

Low quantization fits, but I noticed it lost nuance for logic.

Speed was snappy at first, but dropped off when my context hit 8k.

One card is okay on power, but the room definitely got warmer. tbh I eventually caved and grabbed a second used unit to pool the memory because that VRAM limit is very real. It definitely bumped up my electric bill tho, which is something to watch out for if youre in a place with high rates. It gets pretty loud and the heat is honestly no joke.