If you've worked in software testing, you're used to crisp boundaries: you give an input, you expect a deterministic output, and you can assert that expected === actual. But what happens when the system under test isn't a deterministic program -- it's another AI model?
Welcome to agentic testing, where one AI acts as a tester, prompting and validating the responses of another AI. The challenge is dealing with non-determinism, subjective correctness, and that inception-like feeling of agents testing agents.
The Problem: Determinism Meets Probabilism
Traditional QA works because software is deterministic. A login function that checks username === "admin" will either pass or fail. No ambiguity. No uncertainty.
Ask an AI a simple question like "What do you get when you multiply 6 x 7?" and you might get "42", "6 x 7 = 42", or "the answer is forty-two." All are correct, none match exactly, and strict string assertions fail. Outputs are probabilistic, so the tester has to decide what "correct enough" means.
The Core Challenge of Agentic Testing
- Prompting correctly: the tester must craft instructions so the AI under test responds in a shape that can be validated.
- Handling non-determinism: use fuzzy logic, normalization, or semantic matching rather than exact string equality.
- Validating without hallucinating: the tester AI can be wrong too, so anchor results with ground truth to avoid false positives and negatives.
A Simple Example: Math Chatbot
Imagine testing a math-focused chatbot. The tester agent asks "What is 6 x 7?", the AI under test responds "six times seven equals forty-two", and the tester normalizes the response into a number before comparing against ground truth. Even trivial tests require interpretation layers.
The Inception Problem
If both the tester and the subject are AIs, you can quickly end up in a loop of misunderstandings. One mistake in interpretation introduces false negatives. You cannot blindly trust either side, which is why agentic testing needs anchors to reality.
Techniques for Reliable Agentic Testing
- Use structured outputs: ask the AI under test to respond in JSON to reduce ambiguity.
- Semantic similarity: when free text is unavoidable, use embeddings or similarity scoring so "the answer is forty-two" counts the same as "42".
- External oracles: ground truth via deterministic tools (calculators, date libraries, reference datasets).
- Multiple verifiers: have more than one agent verify the same answer to reduce single-agent error risk.
- Tolerant assertions: replace strict equality with looser checks like regex matches or tolerance ranges.
Why This Matters
AI-powered apps are everywhere, but traditional QA struggles with probabilistic outputs. Agentic testing scales validation across prompts, scenarios, and workflows while handling messy outputs and balancing AI judgment with external truth.
Future Directions: Layers Within Layers
The frontier is multi-step workflows: an agent testing an agent performing tasks powered by other services. Each layer adds noise and ambiguity, but also an opportunity to scale testing far beyond what humans alone can cover.
Closing Thoughts
Testing AI with AI is like navigating dreams within dreams. Anchor to reality with structured outputs, ground truth oracles, and multi-agent consensus, and agentic testing becomes the compass for the layered landscape of AI-powered systems.