How do you evaluate problem-solving skills in LLM agents?

Question

Im trying to build a helper for my plant shop here in Seattle but I have no clue if its actually solving anything or just yapping. Sorry if this is a dumb question.

no budget for paid stuff

easy for a non-tech person

need it by Friday

How do I actually measure if it's smart?

fdvfxwisgw · Accepted Answer

Just saw this today. What specific tasks are you actually automating? Diagnosis, or just inventory? In my experience, you should test these two free options side-by-side:

Mistral AI Mistral 7B v0.1 handles complex reasoning much better for care guides.

Microsoft Phi-3 Mini 3.8B is incredibly fast for basic logic but lacks depth. Mistral is usually more reliable for actual problem-solving, tbh.

zdqqufodst · Answer

> Im trying to build a helper for my plant shop here in Seattle but I have no clue if its actually solving anything or just yapping. I had a similar scare when I first tried automating my hobby garden log. I was using Meta Llama 3 8B Instruct on my local machine and it kept telling people that yellow leaves always meant more water... which is a total death sentence for most indoor plants. You really gotta be careful with these things. The yapping is a real problem because LLMs are basically designed to sound confident even when they are totally hallucinating. I would suggest setting up a simple vignette test. I used to keep a note of 10 specific plant disasters - things like my monstera has brown crispy edges or there are tiny webs on my ivy. If the bot doesnt mention spider mites for the webs, it fails. You basically act like the most difficult customer ever. I would suggest making sure you check for reasoning steps too. If you ask it to solve a problem, tell it to think step by step out loud. If the logic is wonky in the middle, the final answer is usually garbage. Make sure you test it with contradictory info too. Like, tell it you have a low-light apartment but want a Bird of Paradise. If it says yes go for it, then its just yapping to please you. I tried Google Gemini 1.5 Flash for some quick testing since it is free, and it caught some of those logic traps better than smaller models. Just be careful and watch out for it giving generic advice tho... it can get lazy if you dont poke it.

efuvtixwmh · Answer

👆 this