assert response == expected — doesn’t work with LLMs. The answer is different every time, phrasing varies, but the meaning should stay the same. Classic unit tests fail on non-deterministic software. We need a new testing paradigm that validates properties and output quality instead of exact matches. This is a fundamental shift in testing approach that demands new tools, metrics, and processes.
New Approaches¶
Property-based testing: Test properties, not exact output — the response must contain key facts, must not hallucinate, must be in the required language and format. Metamorphic testing: A small change in input (rephrasing a question) must not change the facts in the response. LLM-as-judge: GPT-4 or Claude evaluates responses based on a rubric — assessing relevance, accuracy, completeness, and toxicity. An automated evaluator replaces human assessment for most iterations.
Evaluation Pipeline¶
- Golden dataset: 100+ question/answer pairs covering key scenarios and edge cases
- Automatic run: Evaluation on every PR or nightly build, results in CI dashboard
- Metrics: faithfulness (matches sources), relevance (answers the question), toxicity (safety)
- Regression detection: Alert on score drops greater than 5% — prevents silent degradation
Integrate the pipeline into CI/CD — a merge request with a new prompt or configuration goes through evaluation just like code goes through tests. Ragas, DeepEval, and TruLens are open-source frameworks for automated evaluation.
Red Teaming¶
Automated adversarial testing reveals vulnerabilities: prompt injection (attacker manipulates system prompt), jailbreak (bypassing safety restrictions), PII leakage (model reveals personal data from training data). Run in CI regularly, not as a one-off — new model versions can introduce new vulnerabilities.
AI Testing Is Software Testing 2.0¶
Property-based tests + LLM-as-judge + automated evaluation pipeline = production-ready AI system. Investment in testing infrastructure pays off in AI feature quality and reliability.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us