Ive been spending way too much money on API calls for my coding side projects lately and finally decided its time to just build a dedicated rig for local LLMs. Ive been looking specifically at DeepSeek 67B because the coding performance looks insane compared to Llama 3 for my specific workflow. I did some digging on Reddit and some people say a single RTX 4090 is enough if you use 4-bit quantization but then others are claiming you really need dual 3090s to get any decent speed or to run it at higher precision without it crawling at 1 token per second. Im really torn because my budget is around 2300 Euro and I'm based in Berlin so electricity is pretty pricey here... running two older 3090s sounds like a space heater and a power hog. I also saw some folks mentioning the Mac Studio with M2 Ultra but thats way out of my price range for the RAM I would need. If I go the 4090 route will I regret the 24GB VRAM limit almost immediately? Or should I look into those used Tesla cards? Its all a bit overwhelming trying to figure out the actual VRAM math for DeepSeek specifically when you factor in the context window. What are you guys actually using to get smooth performance out of this model?
Just saw this. Over the years I've realized VRAM math is a lie once context kicks in. Quick question tho: what's your target context length? I tried a single 4090 and it crawled once I fed it a large file.
Definitely grab two used NVIDIA GeForce RTX 3090 24GB GDDR6X cards! That 48GB total VRAM is amazing for 4-bit quants and fits your budget perfectly, unlike a single 4090!
I went through this exact same headache last year when I started hosting my own coding assistants. I started with just one high-end card thinking 24GB would be plenty, but I hit a wall fast once my context window grew. To run this model comfortably, you're likely gonna need more than one card.
Same here!
I've spent several months testing hardware configurations for DeepSeek 67B and unfortunately, the reality of local hosting is quite frustrating. My experience with a single card was particularly underwhelming once the context grew. It just wasnt as good as expected.