Which cloud provider actually has the best GPU availability and price for training a custom LLM right now because honestly I am losing my mind trying to find a single H100 that isnt booked for the next six months? I've got this freelance gig for a small law firm here in Chicago and I'm supposed to have their fine-tuned model ready by the 25th of this month but my local rig just fried its power supply and I'm basically stuck.
I've been looking at AWS because I already have an account there but man the prices for their P4 instances are just insane and I'm pretty sure I'd blow through my $600 project budget in like three days if I'm not careful. Plus their interface is such a mess sometimes. Then there is Lambda Labs which everyone says is the go-to for price but every time I check their cloud console everything is - out of stock - or - reserved - and I cant wait around for a week just to get access to an A100.
I'm also considering CoreWeave or maybe even Paperspace but I've never used them before and I'm worried about the learning curve when I'm on such a tight timeline. Is the performance on CoreWeave actually better for heavy training runs or is it just hype? I really need something where I can just spin up a machine, dump my data, run the fine-tuning script, and be done without jumping through ten hoops of - vCPU limits - and - quota requests - like you have to do with Google Cloud or Azure.
Seriously, if I can't get this started by Monday I'm going to have to tell the client I'm behind and that's not gonna look good. Does anyone know if Lambda's availability has gotten better recently or if there is some smaller provider I'm missing that has decent 80GB A100s for a reasonable hourly rate? I'm leaning toward just biting the bullet on AWS spot instances but I'm terrified of getting kicked off mid-training and losing my progress. What would you guys do in this spot?
Honestly you should check out RunPod! I had the same nightmare last month and was totally losing it. Their community cloud is seriously amazing for quick setups without the corporate headache of AWS. I managed to snag a NVIDIA A100 80GB SXM4 for way cheaper than the big guys. It is super fast to get going, basically just click and run. It totally saved my skin!