AI Testing — How to Test Non-Deterministic Software

assert response == expected — doesn’t work with LLMs. The answer is different every time, phrasing varies, but the meaning should stay the same. Classic unit tests fail on non-deterministic software. We need a new testing paradigm that validates properties and output quality instead of exact matches. This is a fundamental shift in testing approach that demands new tools, metrics, and processes.

New Approaches¶

Property-based testing: Test properties, not exact output — the response must contain key facts, must not hallucinate, must be in the required language and format. Metamorphic testing: A small change in input (rephrasing a question) must not change the facts in the response. LLM-as-judge: GPT-4 or Claude evaluates responses based on a rubric — assessing relevance, accuracy, completeness, and toxicity. An automated evaluator replaces human assessment for most iterations.

Evaluation Pipeline¶

Golden dataset: 100+ question/answer pairs covering key scenarios and edge cases
Automatic run: Evaluation on every PR or nightly build, results in CI dashboard
Metrics: faithfulness (matches sources), relevance (answers the question), toxicity (safety)
Regression detection: Alert on score drops greater than 5% — prevents silent degradation

Integrate the pipeline into CI/CD — a merge request with a new prompt or configuration goes through evaluation just like code goes through tests. Ragas, DeepEval, and TruLens are open-source frameworks for automated evaluation.

Red Teaming¶

Automated adversarial testing reveals vulnerabilities: prompt injection (attacker manipulates system prompt), jailbreak (bypassing safety restrictions), PII leakage (model reveals personal data from training data). Run in CI regularly, not as a one-off — new model versions can introduce new vulnerabilities.

AI Testing Is Software Testing 2.0¶

Property-based tests + LLM-as-judge + automated evaluation pipeline = production-ready AI system. Investment in testing infrastructure pays off in AI feature quality and reliability.

ai testingqualitytestingautomation

CORE SYSTEMS

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Need help with implementation? Schedule a meeting

AI Testing — How to Test Non-Deterministic Software

New Approaches¶

Evaluation Pipeline¶

Red Teaming¶

AI Testing Is Software Testing 2.0¶

CORE SYSTEMS

Need help with implementation?

Related articles

LLM Evaluation — How to Measure the Quality of Text-Generating AI

AI Test Generation — From Unit Tests to E2E Automation

Data Governance — Managing Data Assets in Organization

Great Expectations — Automated Data Quality Validation