Hey everyone! I’ve been diving deep into the world of local LLMs lately, and I’m finally ready to tackle the DeepSeek-67B model. I’ve heard amazing things about its coding and reasoning capabilities, but as we all know, a 67B parameter model is a bit of a beast when it comes to hardware requirements. I’m currently planning a dedicated build for inference and I’m feeling a little stuck on the memory side of things.
Since I’m planning to run this in 4-bit or 5-bit quantization (likely GGUF or EXL2), I know I need a significant amount of VRAM or high-speed system RAM if I go the shared route. My main concern is the bottlenecking. I’ve seen some people swear by multi-GPU setups (like dual 3090s or 4090s) to keep everything on the VRAM, while others are experimenting with Apple Silicon's unified memory or high-bandwidth DDR5 setups on workstation boards.
My current budget for the memory/GPU component is around $2,500, and I’m trying to figure out the sweet spot for tokens per second. If I go the PC route, should I prioritize a motherboard that supports 128GB of high-speed DDR5, or is it better to scrounge for used 24GB cards to keep the model entirely on the GPU? I’m also curious about the impact of memory channels—does moving from dual-channel to quad-channel make a noticeable difference for a model of this scale when offloading?
I really want to avoid a situation where I spend a fortune only to get 1-2 tokens per second because of a bandwidth bottleneck. Has anyone here successfully optimized a rig specifically for DeepSeek-67B? What’s the best RAM/VRAM configuration you’ve found to get smooth, usable inference speeds without breaking the bank on enterprise-grade hardware?
For your situation, I gotta be honest... I've tried the system RAM route and it's a total bottleneck. I'd definitely suggest keeping that beast entirely on VRAM if you want it to be actually usable. I personally went with dual NVIDIA GeForce RTX 3090 24GB cards and it's the only way to get decent speeds. Be careful with DDR5 tho, even quad-channel is sooo slow for 67B. Seriously, just scrounge for the 3090s and you'll be way happier lol. gl!
So basically the consensus is that VRAM is king if you actually want to use this model for more than just a science experiment. Just catching up on this thread and honestly, these guys are spot on about the bandwidth—system RAM is just sooo slow for a beast like DeepSeek-67B. Tbh, i went through a similar phase where I thought I could save cash with a high-end CPU and tons of DDR5, but the 1-2 tokens per second literally drove me crazy.
I ended up grabbing two used NVIDIA GeForce RTX 3090 24GB cards for around $750 each and paired them on a motherboard with decent spacing. For the 4-bit GGUF version, it fits perfectly across those 48GB of VRAM and runs super smooth. Ngl, it's the best $1,500 I've spent on my rig. If you want to stay under your $2,500 budget, I'd definitely grab two of those and maybe a solid power supply like the Corsair RM1000x 1000 Watt 80 Plus Gold to handle the spikes. My lesson learned? Don't even bother with the system RAM offloading for 67B models... it's just gonna frustrate you. Keep it all on the GPUs and youll be way happier with the speed! 👍
yo, jumping in here! basically, LLM speeds are all about memory bandwidth. over the years, i've tried many setups and learned that system RAM is like... painfully slow compared to VRAM. honestly, moving to a quad-channel board helps, but it still feels like a slog. you might find this useful—check out the 'LLM-Perf' leaderboard or the 'Can I Run It' calculator on Hugging Face to see how bandwidth affects tokens. ngl, my current setup relies on high-speed cards because the bottlenecking on DDR5 is realy real. gl!!
Subbing for updates
Saw this earlier but just getting a chance to reply now. If you are going the multi-GPU route for DeepSeek-67B, you really need to prioritize your power supply and cooling or you will get random crashes. Those used 3090s can spike way past their rated TDP during heavy inference. I would go with the be quiet! Dark Power Pro 13 1600W since it is ATX 3.0 compliant and handles those transient spikes much better than older units. For the actual cards, if you can find the MSI GeForce RTX 3090 Gaming X Trio 24GB, those tend to have slightly better stock cooling than the basic blower models. Pair that with a Fractal Design Torrent case because it has the best airflow for multi-GPU setups without going for an open-air rack. Even if you are doing 4-bit entirely on VRAM, grab a Kingston FURY Beast 64GB DDR5 6000MHz kit just so your system does not crawl when you are swapping models or running context-heavy tasks. Just watch out for the sag on those heavy cards, you will definitely need a support bracket.
Caught this thread a bit late but yeah, Bob_Am is spot on about needing multiple cards for this model. Unfortunately, even once you get the hardware set up, the software side can be a real letdown. I had issues with P2P communication between cards on a consumer board and it totally tanked the speeds I was expecting. It's just not as good as it looks on paper when the driver overhead starts eating into your performance... definitely check if your board handles P2P traffic well or you'll be disappointed with the results.
I'm still pretty new to the hardware side of LLMs, but I’ve been doing some market research because I’m honestly TERRIFIED of buying used parts that might break. Since you have a $2,500 budget, I’ve been looking at how different brands compare for stability and ease of use. Here’s what I found from a more cautious, "buy-it-new" perspective: • Apple Mac Studio with M2 Max: I’ve seen a lot of beginners lean towards this because of the unified memory. If you get the 64GB or 96GB version, it handles the 67B model pretty easily without needing multiple GPUs. It’s basically plug-and-play, which seems way safer than building a custom water-cooled rig.
• ASUS TUF Gaming GeForce RTX 4090: If you prefer a PC, one of these is around $1,800 new. You’d still need a second card for a 67B model, but maybe a PNY GeForce RTX 4070 Ti Super could work as a secondary? I’m not sure if mixing brands or models like that causes issues though.
• Corsair Vengeance DDR5 128GB RAM kit: If you do go the system RAM route, these kits are pretty affordable now, but like others said, I worry about the speed. Is it really hard to set up two cards from different brands? I’m just worried about things not being compatible if I don't buy everything from the same manufacturer.
Stumbled on this topic and honestly, if youre going the DIY route with a $2,500 budget, you can actually build a dedicated rig that crushes this. Most people stop at two cards, but for DeepSeek-67B, you really want that extra headroom for higher quants or longer context windows. Basically, I’d skip the fancy workstation boards and go for a "server-lite" DIY approach to keep everything on the GPUs: • Aim for three used NVIDIA GeForce RTX 3090 cards. If you hunt around, you can get them for about $750 each. That 72GB VRAM total is the sweet spot for running Q5 or Q6 quants with a massive context window without any slowdowns.
• Use an open-air frame like the Veddha 6-GPU Mining Case. It sounds janky but keeping those cards cool is the biggest challenge for long inference sessions. Standard cases just cant breathe with that much heat.
• Dont cheap out on the power supply. Something like the EVGA SuperNOVA 1600 G+ is basically mandatory to handle the transient spikes from multiple high-end cards. It takes a bit more work to setup compared to a pre-built, but staying 100% on VRAM is the only way to avoid that 1-2 t/s crawl. Works great tho!!!
Adding my two cents since I've spent way too much time benchmarking these 60B+ models lately. Everyone focuses on the total VRAM but the real killer for DeepSeek-67B is the prompt ingestion time if your PCIe lanes are gimped. If you're building from scratch with 2500 bucks, you gotta look at the bandwidth between the cards or you'll be waiting ten seconds for the model to even start talking.
I have been obsessing over the hardware compatibility side of this lately because I am kinda terrified of buying parts that wont actually fit together physically. I mean, even if you get the VRAM right, there are so many weird technical snags like card thickness and power cable clearance that nobody mentions until your parts are sitting on the floor. I ran into a few things while researching my own build that might be worth looking into:
Bookmarked, thanks!
Saving this thread