so i’ve been messin around with llm agents for a few years now—mostly standard rag and support bots—but i just hit a massive wall trying to build something way more advanced. i’m working on a b2b supply chain tool for a pilot launch in october and i need these agents to handle actual complex negotiations. its not just about asking for a price it’s about trading off variables like lead times vs bulk discounts vs payment terms.
right now the agent is just too nice?? or it totally loses the plot after three turns and forgets its walk-away price. i’ve tried some basic chain-of-thought and i even did a bit of fine-tuning on llama 3 with some old sales logs but it still feels like it’s just roleplaying a push-over. i really need it to be strategic and actually push back when the 'human' side gets aggressive or tries to lowball.
heres what im looking at right now:
should i be looking into reinforcement learning or maybe some kind of game theory framework like monte carlo tree search for the decision tree? i’m worried mcts might be overkill but honestly regular prompting isnt cutting it for these multi-step trade-offs. has anyone actually successfully built a 'hard' negotiator that doesn't just hallucinate concessions? i feel like im missing something obvious in how to model the state...
Ive been through this exact nightmare with procurement bots. The niceness is baked into the RLHF, so you really have to fight it. Skip the MCTS for now... its a massive time sink for a 6-week deadline. Instead, use a dual-agent architecture in LangGraph. One agent drafts the response, and a Hard-Nosed Auditor agent reviews it strictly against your walk-away price and variables before anything is sent. In my experience, you need better reasoning than base Llama 3 for these trade-offs. Here is what I would do:
Saw this late but I've been down this road. In my experience, letting an LLM handle the final price is a huge liability. I once saw a pilot fail because a bot hallucinated a massive discount on an Intel Xeon Platinum 8480+ 56-Core 2.0GHz. Honestly, just keep your walk-away constraints in a hard-coded lookup table outside the prompt. Its way safer and saves your budget from token bloat.