Skip to content
_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN DE
Let's talk

AI Testing — How to Test Non-Deterministic Software

02. 04. 2025 Updated: 27. 03. 2026 1 min read CORE SYSTEMSai
AI Testing — How to Test Non-Deterministic Software

assert response == expected — doesn’t work with LLMs. The answer is different every time, phrasing varies, but the meaning should stay the same. Classic unit tests fail on non-deterministic software. We need a new testing paradigm that validates properties and output quality instead of exact matches. This is a fundamental shift in testing approach that demands new tools, metrics, and processes.

New Approaches

Property-based testing: Test properties, not exact output — the response must contain key facts, must not hallucinate, must be in the required language and format. Metamorphic testing: A small change in input (rephrasing a question) must not change the facts in the response. LLM-as-judge: GPT-4 or Claude evaluates responses based on a rubric — assessing relevance, accuracy, completeness, and toxicity. An automated evaluator replaces human assessment for most iterations.

Evaluation Pipeline

  • Golden dataset: 100+ question/answer pairs covering key scenarios and edge cases
  • Automatic run: Evaluation on every PR or nightly build, results in CI dashboard
  • Metrics: faithfulness (matches sources), relevance (answers the question), toxicity (safety)
  • Regression detection: Alert on score drops greater than 5% — prevents silent degradation

Integrate the pipeline into CI/CD — a merge request with a new prompt or configuration goes through evaluation just like code goes through tests. Ragas, DeepEval, and TruLens are open-source frameworks for automated evaluation.

Red Teaming

Automated adversarial testing reveals vulnerabilities: prompt injection (attacker manipulates system prompt), jailbreak (bypassing safety restrictions), PII leakage (model reveals personal data from training data). Run in CI regularly, not as a one-off — new model versions can introduce new vulnerabilities.

AI Testing Is Software Testing 2.0

Property-based tests + LLM-as-judge + automated evaluation pipeline = production-ready AI system. Investment in testing infrastructure pays off in AI feature quality and reliability.

ai testingqualitytestingautomation
Share:

CORE SYSTEMS

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Contact us
Need help with implementation? Schedule a meeting