← Back to blog

AI Agents Testing Engineering

May 23, 2026

Setting up a validation and testing harness for agent-led software development

Agent-led development moves fast — but speed without a harness is how you ship confident demos that fail under real load, real data, and real adversaries. Here's what to test beyond "it feels like it works."

Teams shipping with coding agents often celebrate the same milestone: the demo runs, the UI looks right, and nobody filed a bug in the first hour. That moment is useful — and dangerously incomplete.

Agent-led software is non-deterministic, tool-heavy, and often distributed by default (model APIs, vector stores, queues, human-in-the-loop steps). A validation harness for this world has to answer a different question than traditional CI:

Not just "did the code compile?" but "did the system behave acceptably across cost, latency, safety, and failure — every time we ship?"

Why vibe coding and "it works" feel tests are not enough

Vibe coding is the practice of iterating until the output feels correct: you prompt, skim, tweak, and ship when the agent's answer matches your intuition. "It works" feel tests are the manual version — click through once, see a plausible result, merge.

Both fail for predictable reasons in agent systems:

Non-repeatability: The same prompt can produce different tool calls, different token usage, and different answers on the next run. A green manual pass is not a regression signal.
Happy-path blindness: Agents excel at plausible narratives. They can "sound right" while calling the wrong API, skipping a policy check, or hallucinating a field that doesn't exist in your schema.
Hidden coupling: A demo that works with one user, one document, and one model version may break when context length grows, retrieval returns no chunks, or rate limits throttle tool calls.
No failure budget: Feel tests rarely exercise timeouts, partial tool failures, retries, or human escalation — yet production hits all of these weekly.
Cost blindness: What feels fast in a dev session may be economically impossible at 10× traffic because nobody measured tokens per successful task.
Security theater: "I tried prompt injection and it didn't work" is not a test suite. Attackers don't stop at your first three examples.

The goal of a harness is not to eliminate agents' variability — it's to bound it: define acceptable envelopes for quality, cost, latency, and risk, and fail CI when you drift outside them.

What a modern agent testing harness includes

Think in layers. Unit and end-to-end tests still matter, but they're insufficient alone.

Layer	What you're validating	Example signals
Correctness	Outputs, tool traces, structured data	Golden tasks, schema validation, rubric scores
Behavioral / eval	Multi-step reasoning and policy adherence	LLM-as-judge (with human calibration), trajectory checks
Performance	Latency and throughput	p50/p95 time-to-first-token, end-to-end task time
Cost / token budget	Economic viability per outcome	Tokens per successful task, tool-call count caps
Load & capacity	Behavior under concurrency	Queue depth, worker saturation, provider rate limits
Security	Injection, exfiltration, privilege abuse	Red-team suites, tool allowlists, data boundary tests
Chaos & resilience	Degraded dependencies	Killed workers, slow RAG, model 429s, stale caches

1. Correctness beyond unit tests

Unit tests should cover deterministic pieces: parsers, policy functions, idempotency keys, state machines. For the agent itself, add task-level golden sets: fixed inputs with expected properties (not always exact strings).

Validate structured outputs against JSON Schema or protobuf contracts.
Assert on tool trajectories: "must call search before update_ticket," "must never call delete_* without approval."
Use reference answers for high-stakes domains, with tolerances for phrasing but hard constraints on facts and actions.

2. End-to-end tests — with statistical discipline

E2E tests in agent systems should run batches, not single shots. A flaky pass is expected; track pass rate, median quality score, and worst-case failures over N runs. Gate releases on trends, not one lucky run.

3. Performance testing

Measure latency at the user task boundary, not just the model call:

Time to first useful token (streaming UX)
End-to-end completion time including retrieval and tool round-trips
Tail latency (p95/p99) — agents with multiple tool hops have heavy tails

Set budgets per workflow class. A research agent can be slow; a checkout support agent cannot.

4. Token bloat and cost regression tests

Token usage is a regression surface like CPU or memory. Track per task:

Input + output tokens (and cached-token hits if your provider supports them)
Number of model round-trips and tool calls
Retrieved context size (chunks × average tokens)

Fail CI when a change increases median tokens per successful outcome beyond a threshold — otherwise every "small prompt tweak" silently doubles your COGS.

5. Load and capacity testing

Load tests answer: what breaks first? Often it's not your app server — it's the embedding API, the vector DB, or provider rate limits.

Simulate concurrent sessions with realistic think-time and tool patterns.
Stress the orchestration layer (queues, workers, workflow engines).
Validate backpressure: shedding load gracefully beats unbounded retries that amplify outages.

6. Security testing

Agent security testing should be continuous, not a one-off pen test:

Prompt injection via untrusted documents, emails, and web pages the agent reads
Tool abuse: can indirect instructions trigger privileged actions?
Data exfiltration: can the agent be steered to send internal context to external endpoints?
Supply chain: pinned tool schemas, verified MCP servers, secrets scoping per environment

Maintain an evolving attack corpus — new jailbreaks and injection patterns appear monthly.

7. Chaos engineering for agent pipelines

Chaos tests prove your system degrades safely:

Model API returns 429/503 — do you retry with jitter or fail open to a human?
RAG returns empty — does the agent admit uncertainty or invent citations?
Tool timeout mid-workflow — is state recoverable? Are partial side effects rolled back?
Worker crash — does another worker resume from durable checkpoint?

The pass criteria is not "no errors" — it's controlled failure modes with audit logs and user-safe messaging.

How distributed systems change the picture

Most production agent stacks are already distributed: API gateways, async job queues, multiple model routes, retrieval services, observability pipelines, and human approval queues. That shifts testing in several ways.

From single-process to workflow-level SLOs

You no longer ship a function — you ship a workflow graph. Define SLOs per workflow (success rate, latency, cost) and test the graph as a whole. A fast LLM behind a slow queue still misses the SLO.

Consistency and idempotency

Distributed agents retry. Tool calls must be idempotent or guarded with deduplication keys. Tests should replay the same event twice and assert you don't double-charge, double-email, or double-update records.

Eventual consistency in memory and state

Session memory, vector indexes, and feature flags update asynchronously. Harness tests should include stale-read scenarios: user updates a preference, agent still sees old retrieval for N seconds — is that acceptable?

Observability as a test artifact

In distributed agent systems, traces are part of the contract. Assertions on spans help you catch regressions unit tests miss:

Unexpected extra model calls after a refactor
Missing approval step before a write tool
Cross-region latency spikes on retrieval

Multi-tenant isolation

Load tests must include noisy neighbor patterns: one tenant's heavy job should not exhaust shared rate limits for others. Security tests must verify tenant A's context never appears in tenant B's retrieval results.

Practical harness blueprint (start here)

You don't need a perfect platform on day one. A useful v1 harness usually includes:

Golden task suite (20–50 representative jobs) with schema and trajectory checks
Nightly eval batch with pass-rate and quality-score thresholds
Token and latency budgets enforced in CI on a subset of golden tasks
Security corpus run on every release candidate
Monthly chaos drill on staging (provider failures, tool timeouts, queue backlog)
Load test gate before major traffic events (product launch, holiday support)

Wire this into the same pipeline that builds agent-led features — not as a post-launch audit. The teams that win treat eval infrastructure as product infrastructure, not as research overhead.

Closing: speed with guardrails

Agent-led development can stay fast. The harness isn't there to slow you down — it's there to stop you from confusing demonstration success with operational success.

If you're moving from pilots to production agents, start by picking one workflow and defining its envelope: quality floor, cost ceiling, latency target, and failure behavior. Then build tests that fail loudly when you leave that envelope — before your customers find out.

CTA: Need help designing an agent validation harness for your stack? I work with teams to map golden tasks, SLOs, and CI gates that match real business risk — not checkbox testing.