How do I know if a change in my AI app improves or hurts quality?

Question

Accepted Answer

1. Collect real questions and the correct answer (or acceptable variants). Start with the top live questions plus ten edge cases that historically failed.
2. Pick an eval tool: Promptfoo, Braintrust, LangSmith, or DIY with a list and assertions. For SMB Promptfoo has the lowest threshold.
3. Define scoring: exact match works for lists or structured output, 'LLM-as-judge' (a stronger model deciding correctness) works for open answers. Combine where possible.
4. Run the suite on every prompt change and every model swap. Pin the baseline in commit history so regressions show.
5. Add a case whenever you see a production failure. A bug you fix today must never come back without the eval suite catching it.

When to bring us in: 
Want us to set up the first eval set and pipeline with your own prompts and known failure cases, we can have it running in a day.

How do I know if a change in my AI app improves or hurts quality?

Try this first

When to bring us in

See also

None of the above fits?

Who are you?

Or skip the DIY entirely