Im trying to build a helper for my plant shop here in Seattle but I have no clue if its actually solving anything or just yapping. Sorry if this is a dumb question.
How do I actually measure if it's smart?
Just saw this today. What specific tasks are you actually automating? Diagnosis, or just inventory? In my experience, you should test these two free options side-by-side:
> Im trying to build a helper for my plant shop here in Seattle but I have no clue if its actually solving anything or just yapping. I had a similar scare when I first tried automating my hobby garden log. I was using Meta Llama 3 8B Instruct on my local machine and it kept telling people that yellow leaves always meant more water... which is a total death sentence for most indoor plants. You really gotta be careful with these things. The yapping is a real problem because LLMs are basically designed to sound confident even when they are totally hallucinating. I would suggest setting up a simple vignette test. I used to keep a note of 10 specific plant disasters - things like my monstera has brown crispy edges or there are tiny webs on my ivy. If the bot doesnt mention spider mites for the webs, it fails. You basically act like the most difficult customer ever. I would suggest making sure you check for reasoning steps too. If you ask it to solve a problem, tell it to think step by step out loud. If the logic is wonky in the middle, the final answer is usually garbage. Make sure you test it with contradictory info too. Like, tell it you have a low-light apartment but want a Bird of Paradise. If it says yes go for it, then its just yapping to please you. I tried Google Gemini 1.5 Flash for some quick testing since it is free, and it caught some of those logic traps better than smaller models. Just be careful and watch out for it giving generic advice tho... it can get lazy if you dont poke it.
Adding my two cents here because trying some of the big names was a letdown. They just hallucinated too much for my comfort... its actually pretty dangerous when they give confident advice that might kill a plant. To see if your agent is smart, just make a small golden set of 10 trick questions. Ask it what to do if a succulent is mushy, for example. If it doesnt say stop watering immediately, its a failure. Smaller models caused me a lot of issues by making stuff up just to sound helpful. For a free setup that actually thinks, check out Qwen 2.5 7B Instruct or Mistral NeMo 12B Instruct. Qwen has been way more reliable for logic in my experience. Tiny models just arent as good as people say... they usually just yap.
👆 this