What is the best GPU to run DeepSeek v4 Flash locally?

Question

I need to get DeepSeek v4 Flash running locally by Friday for this client demo here in Seattle and honestly I'm a bit stuck on the VRAM requirements. My logic was that a single 4090 with 24GB would handle the 4-bit quantization no problem since it's the 'Flash' version but the context window is giving me some weird OOM errors when I try to push it. I've been running LLMs for years but this v4 architecture seems to be eating more memory than I anticipated. Should I just bite the bullet and go for dual 3090s for the extra memory pool or is there a way to optimize this? I'm on a tight $1800 budget and need to know what's actually gonna work without lagging...

rgtuiiutvp · Accepted Answer

Just seeing this now and man, the demo effect is real... I've been building these rigs since the early days of the original Titan cards and there's nothing worse than an OOM error while a client is watching. Everyone focuses on the weights, but that KV cache is what bites you in the end. For DeepSeek v4 Flash, even with 4-bit, that context window expansion is brutal when you actually start a real convo. I actually faced this exact choice for a project last fall. Instead of the dual card route which can be a headache with heat and power, I hunted for a used NVIDIA RTX A6000 48GB GDDR6. It basically puts that 48GB pool on a single card. I've seen them go for around $1700 on the used market lately if you're lucky. The big plus over your NVIDIA GeForce RTX 4090 24GB GDDR6X isnt the raw clock speed—it's that massive VRAM buffer for long-form inference. In my experience, the 4090 is faster for short bursts, but the A6000 lets you breathe when the conversation goes long and the context starts filling up. Another thing I've tried is keeping the main GPU and adding a NVIDIA GeForce RTX 3060 12GB GDDR6 just to offload the context cache, but honestly? For a high-stakes demo in Seattle, you want the simplest chain possible. Go for the 48GB single card and dont look back. It'll save your sanity when the client starts asking long-winded questions and the memory pressure spikes.

RichardGrags · Answer

Dude, I totally feel your pain with those OOM errors, v4 Flash is a beast! Honestly, I love the 4090 but if you need that context window for a demo, VRAM is king.

Dual NVIDIA GeForce RTX 3090 24GB GDDR6X: 48GB is a lifesaver for context. Used ones fit your budget!

Single NVIDIA GeForce RTX 4090 24GB GDDR6X: Fast, but 24GB is a wall. Go dual 3090s, its gonna work like a charm!