What are the best practices for teaching agents new tool-use skills?

Question

Ive been building agents for a while now, mostly just simple stuff with LangChain or just hitting OpenAI directly, and usually its fine. But lately I've been trying to teach this new agent how to use a really specific legacy database tool for a logistics firm I'm consulting for in Chicago—we have this deadline on Friday and I am honestly panicking a bit because the model just wont follow the specs. My logic was that if I just gave it a really clean JSON schema in the system prompt it would just 'get' it, you know? Like, it knows how to call a weather API, so why is it struggling with a custom SQL wrapper?

So I was thinking maybe the documentation I'm feeding it is too dense. I tried stripping it down to just the essential parameters but then it started hallucinating arguments that dont even exist in my codebase. Its super frustrating because I thought I had the basics down. I even tried some few-shot examples in the prompt, like literally walking it through three different scenarios of how to query the shipment IDs, but it still trips up when the user input gets even a little bit ambiguous.

Then I started wondering if I should be fine-tuning a smaller model specifically for this tool-use or if that's just overkill for a single project? I really dont want to spend the time or the compute budget on fine-tuning if there's a better way to structure the 'teaching' process via the prompt or maybe a middle-ware layer. Has anyone found a sweet spot for how much context to provide vs how much to rely on the models inherent reasoning? I'm worried that if I give it too much info it gets 'lost in the middle' but if I give too little it just guesses. My brain is kinda fried trying to figure out if there's a standardized 'curriculum' or something for training these things on proprietary tools without them losing their minds. Should I be using a separate 'planner' agent just to handle the tool selection or is that just adding more latency and points of failure? I'm just looking for what actually works in production when the stakes are high...

PorkPiePassion · Accepted Answer

Saw this a bit late but ive been down this road way too many times with old SQL wrappers. Honestly, just dumping the raw schema into the prompt is usually a recipe for disaster because legacy DBs have those weird naming conventions that mess with the models logic. In my experience, building a shim layer is way more effective than fine-tuning. I usually write high-level python functions that act as a simplified interface, and then I just expose those to the agent. It keeps the context window clean and prevents those hallucination issues you're seeing. Lately, I switched to Anthropic Claude 3.5 Sonnet API for this stuff and its tool-use is just way more robust. It handles ambiguity way better than GPT-4o in my testing. Also, definitely try an error-correction loop where you feed the tool error back to the agent... it usually fixes itself on the second try without you needing a separate planner agent.

ypdlggvvrv · Answer

Man, I totally feel your pain. I tried building a tool-use agent for a local warehouse last month and unfortunately, just dumping a schema into the prompt was not as good as expected. I had issues with the model hallucinating table names that werent even there... it was super frustrating. I really wanted it to work out of the box but it just didnt. Heres what I found actually helps:

Upgrade to OpenAI GPT-4o High-Intelligence Model because its native tool-calling is way more robust for legacy stuff than the older versions.

Use a middle-ware layer like LangChain LangGraph Framework to force a validation step before the tool actually executes.

Give the model a tiny dictionary of what the legacy terms actually mean in plain English so it stops guessing. Dont give up tho! Its a steep learning curve but you will get it. If you need more help with the specific prompt structure or how to set up the validation, just let me know!