How do you evaluate problem-solving skills in LLM agents?

Question

Im trying to build a helper for my plant shop here in Seattle but I have no clue if its actually solving anything or just yapping. Sorry if this is a dumb question.

no budget for paid stuff

easy for a non-tech person

need it by Friday

How do I actually measure if it's smart?

fdvfxwisgw · Accepted Answer

Just saw this today. What specific tasks are you actually automating? Diagnosis, or just inventory? In my experience, you should test these two free options side-by-side:

Mistral AI Mistral 7B v0.1 handles complex reasoning much better for care guides.

Microsoft Phi-3 Mini 3.8B is incredibly fast for basic logic but lacks depth. Mistral is usually more reliable for actual problem-solving, tbh.

zdqqufodst · Answer

> Im trying to build a helper for my plant shop here in Seattle but I have no clue if its actually solving anything or just yapping. I had a similar scare when I first tried automating my hobby garden log. I was using Meta Llama 3 8B Instruct on my local machine and it kept telling people that yellow leaves always meant more water... which is a total death sentence for most indoor plants. You really gotta be careful with these things. The yapping is a real problem because LLMs are basically designed to sound confident even when they are totally hallucinating. I would suggest setting up a simple vignette test. I used to keep a note of 10 specific plant disasters - things like my monstera has brown crispy edges or there are tiny webs on my ivy. If the bot doesnt mention spider mites for the webs, it fails. You basically act like the most difficult customer ever. I would suggest making sure you check for reasoning steps too. If you ask it to solve a problem, tell it to think step by step out loud. If the logic is wonky in the middle, the final answer is usually garbage. Make sure you test it with contradictory info too. Like, tell it you have a low-light apartment but want a Bird of Paradise. If it says yes go for it, then its just yapping to please you. I tried Google Gemini 1.5 Flash for some quick testing since it is free, and it caught some of those logic traps better than smaller models. Just be careful and watch out for it giving generic advice tho... it can get lazy if you dont poke it.

TrafalgalSquare · Answer

Adding my two cents here because trying some of the big names was a letdown. They just hallucinated too much for my comfort... its actually pretty dangerous when they give confident advice that might kill a plant. To see if your agent is smart, just make a small golden set of 10 trick questions. Ask it what to do if a succulent is mushy, for example. If it doesnt say stop watering immediately, its a failure. Smaller models caused me a lot of issues by making stuff up just to sound helpful. For a free setup that actually thinks, check out Qwen 2.5 7B Instruct or Mistral NeMo 12B Instruct. Qwen has been way more reliable for logic in my experience. Tiny models just arent as good as people say... they usually just yap.