I need to get DeepSeek v4 Flash running locally by Friday for this client demo here in Seattle and honestly I'm a bit stuck on the VRAM requirements. My logic was that a single 4090 with 24GB would handle the 4-bit quantization no problem since it's the 'Flash' version but the context window is giving me some weird OOM errors when I try to push it. I've been running LLMs for years but this v4 architecture seems to be eating more memory than I anticipated. Should I just bite the bullet and go for dual 3090s for the extra memory pool or is there a way to optimize this? I'm on a tight $1800 budget and need to know what's actually gonna work without lagging...
Dude, I totally feel your pain with those OOM errors, v4 Flash is a beast! Honestly, I love the 4090 but if you need that context window for a demo, VRAM is king.