How do I know if a change in my AI app improves or hurts quality?
Without an eval suite you change a prompt, it 'feels' better, and weeks later you find 30 percent of an edge case now fails. An eval suite is a list of input-output pairs you run on every change, like unit tests for LLM output. Start small, twenty to fifty cases already make a world of difference.
Try this first
- 1Collect real questions and the correct answer (or acceptable variants). Start with the top live questions plus ten edge cases that historically failed.
- 2Pick an eval tool: Promptfoo, Braintrust, LangSmith, or DIY with a list and assertions. For SMB Promptfoo has the lowest threshold.
- 3Define scoring: exact match works for lists or structured output, 'LLM-as-judge' (a stronger model deciding correctness) works for open answers. Combine where possible.
- 4Run the suite on every prompt change and every model swap. Pin the baseline in commit history so regressions show.
- 5Add a case whenever you see a production failure. A bug you fix today must never come back without the eval suite catching it.
When to bring us in
Want us to set up the first eval set and pipeline with your own prompts and known failure cases, we can have it running in a day.
See also
- Can I paste a customer file or email into ChatGPT?Depends on the account and settings. Free ChatGPT and a Team tenant behave very differently from what most people assume.
- I want a one-page AI policy for my teamA real one-pager beats a thick document nobody reads. Four headers and concrete examples.
- How do I tell if an AI answer is made up?Models sound confident even when they are wrong. A few habits catch most mistakes.
None of the above fits?
Describe your situation below. We pass your input plus the steps you already saw to our AI and return tailored next-step advice. If it's too risky to DIY, we'll say so.
Or skip the DIY entirely
Our Managed IT clients do not look these things up. One point of contact, a fixed monthly price, resolved within working hours.