Architecture Overview
Agent evaluations provide a systematic framework for measuring AI agent performance across non-deterministic outputs. The core pipeline flows from task definition through an agent harness that produces transcripts, which are then scored by one or more graders to produce outcomes. Suites of 100s-1000s of tasks run in parallel, with pass@k and pass^k metrics capturing both capability ceilings and reliability floors. Anthropic advocates a Swiss Cheese model where automated evals, production monitoring, A/B testing, user feedback, manual review, and human studies form complementary layers — no single method catches everything.
Based on Demystifying evals for AI agents (Jan 2026)
Input + expected behavior"] TRIAL1["Trial 1"] TRIAL2["Trial 2"] TRIALN["Trial N"] end subgraph Harness["Agent Harness"] direction TB ENV["Environment Setup
Isolated sandbox per trial"] AGENT["Agent Under Test
Model + tools + prompts"] EXEC["Execution
Multi-step tool use"] end subgraph Transcript["Transcript"] direction TB STEPS["Step-by-step log
Actions, tool calls, outputs"] end subgraph Graders["Grading Layer"] direction LR CODE["Code-Based Grader
Deterministic checks,
unit tests, regex"] MODEL["Model-Based Grader
LLM judge with rubric,
semantic matching"] HUMAN["Human Grader
Gold-standard review,
calibration baseline"] end subgraph Outcome["Outcome and Metrics"] direction LR SCORE["Score per Trial
Pass/fail or 0-1"] PASSK["pass@k
At least 1 success in k trials"] PASSK2["pass^k
All k trials succeed"] end TASK --> TRIAL1 TASK --> TRIAL2 TASK --> TRIALN TRIAL1 --> ENV TRIAL2 --> ENV TRIALN --> ENV ENV --> AGENT --> EXEC EXEC --> STEPS STEPS --> CODE STEPS --> MODEL STEPS --> HUMAN CODE --> SCORE MODEL --> SCORE HUMAN --> SCORE SCORE --> PASSK SCORE --> PASSK2 style Suite fill:#1c2333,stroke:#58a6ff,stroke-width:2px,color:#e6edf3 style Harness fill:#1c2333,stroke:#3fb950,stroke-width:2px,color:#e6edf3 style Transcript fill:#1c2333,stroke:#d29922,stroke-width:2px,color:#e6edf3 style Graders fill:#1c2333,stroke:#bc8cff,stroke-width:2px,color:#e6edf3 style Outcome fill:#1c2333,stroke:#39d2c0,stroke-width:2px,color:#e6edf3
Eval Terminology
| Term | Definition |
|---|---|
| Task | A single problem instance with defined inputs and success criteria. The atomic unit of evaluation. |
| Trial | One execution of a task by the agent. Multiple trials per task capture non-deterministic variation. |
| Grader | A function that scores agent output against expected behavior. Can be code-based, model-based, or human. |
| Transcript | The full step-by-step record of an agent's actions, tool calls, and outputs during a trial. |
| Outcome | The grader's verdict on a trial — typically pass/fail or a score between 0 and 1. |
| Evaluation harness | The infrastructure that orchestrates task selection, trial execution, grading, and metric aggregation. |
| Agent harness | The wrapper that provides the agent with its environment (tools, sandbox, credentials) for each trial. |
| Evaluation suite | A curated collection of 100s-1000s of tasks designed to measure a specific capability or track regressions. |
Types of Graders
Choosing the right grader type is a core design decision. Most production eval suites use a combination, with code-based graders for objective checks and model-based graders for subjective quality assessment.
| Dimension | Code-Based | Model-Based | Human |
|---|---|---|---|
| Methods | Unit tests, regex matching, string comparison, assertion checks, execution-based verification | LLM-as-judge with rubric, pairwise comparison, semantic similarity scoring | Expert review with scoring rubric, A/B preference testing, calibration sessions |
| Strengths | Fast, deterministic, zero cost per run, perfectly reproducible, no false positives from grader errors | Handles open-ended outputs, captures nuance and partial credit, scales to 1000s of tasks | Gold-standard accuracy, catches subtle quality issues, provides calibration baseline for other graders |
| Weaknesses | Cannot assess subjective quality, brittle to valid alternative solutions, high upfront authoring cost | Non-deterministic (grader itself varies), can hallucinate scores, requires careful prompt engineering | Expensive ($5-50 per task), slow (hours to days), does not scale, inter-rater disagreement |
Evaluating Agent Types
Different agent categories require distinct eval strategies because their outputs, environments, and failure modes vary significantly.
| Agent Type | Focus | Key Benchmarks | Grading Strategy |
|---|---|---|---|
| Coding Agents | End-to-end code generation, bug fixing, test writing | SWE-bench (resolve real GitHub issues), HumanEval | Code-based: run test suites against generated code. Pass/fail based on test outcomes. |
| Conversational Agents | Multi-turn dialogue, tool use in conversation, user satisfaction | tau-Bench (simulated retail and airline customer service) | Model-based: LLM judges conversation quality. Hybrid with code checks for tool call correctness. |
| Research Agents | Information gathering, synthesis, multi-step web browsing | BrowseComp (find hard-to-locate facts on the web) | Code-based for factual accuracy (exact match). Model-based for synthesis quality and completeness. |
| Computer Use Agents | GUI interaction, form filling, multi-app workflows | WebArena (web tasks), OSWorld (desktop OS tasks) | Code-based: check final application state (DOM, file system). Screenshot comparison for visual tasks. |
Non-Determinism: pass@k vs pass^k
AI agents are inherently non-deterministic — the same task can produce different results across runs. Two complementary metrics capture different aspects of this variability:
- pass@k (capability ceiling): The probability that at least 1 out of k independent trials succeeds. Answers "can the agent do this at all?" A task with pass@5 = 80% means the agent can solve it most of the time if given multiple attempts. Useful for capability evals where you want to know the frontier of what the agent can achieve.
- pass^k (reliability floor): The probability that all k independent trials succeed. Answers "can the agent do this reliably?" A task with pass^5 = 30% means the agent only consistently solves it 30% of the time across all attempts. Useful for regression evals and production readiness where every run must succeed.
- The gap between pass@k and pass^k reveals flakiness: If pass@5 is 90% but pass^5 is 20%, the agent has the capability but lacks reliability. This gap is the signal to investigate — read transcripts of failing trials to find the failure modes causing inconsistency.
- Practical guidance: Run at least 3-5 trials per task. For high-stakes production decisions, run more. Track both metrics over time — a rising pass@k with flat pass^k means the agent is getting more capable but not more reliable.
The 8-Step Eval Roadmap
Anthropic recommends this progression for building an eval suite, from first prototype to production-grade evaluation infrastructure:
- Start early: Build evals from day one, not after the agent is "ready." Even 5-10 hand-written tasks provide signal that manual testing cannot.
- Harvest manual tests: Every time you manually test the agent, convert that interaction into a reproducible eval task. Your manual testing backlog is your best task source.
- Write unambiguous tasks: Each task must have a clear, objectively verifiable success criterion. Ambiguous tasks produce noisy signals that waste engineering time.
- Build balanced problem sets: Include easy (sanity checks), medium (core capability), and hard (frontier) tasks. An all-hard suite cannot distinguish "completely broken" from "slightly degraded."
- Stabilize the harness: Invest in reproducible environments — deterministic setup, isolated sandboxes, pinned dependencies. Flaky infrastructure produces flaky results indistinguishable from agent flakiness.
- Design thoughtful graders: Match grader type to output type. Use code-based graders for verifiable outputs, model-based for subjective quality. Calibrate model graders against human judgments.
- Read transcripts: Aggregate metrics tell you what failed; transcripts tell you why. Regularly read 10-20 failing transcripts to identify systematic failure patterns.
- Monitor saturation: When pass rates approach 90-95%, the suite is saturating. Add harder tasks or move to the next capability frontier. A saturated eval gives false confidence.
Key Design Decisions
| Decision | Chosen Approach | Alternative | Rationale |
|---|---|---|---|
| Grading approach | Code-based graders for verifiable outputs | Model-based (LLM-as-judge) | Code graders are deterministic and free to run. Use model graders only when output is subjective or open-ended — they add non-determinism to the eval itself. |
| Eval type | Capability evals for development, regression evals for CI | Single eval suite for both | Capability evals explore frontiers (pass@k matters); regression evals guard against breakage (pass^k matters). Mixing them conflates "can it?" with "does it still?" |
| Success metric | pass@k for capability, pass^k for reliability | Single pass rate (1 trial per task) | Single-trial pass rate conflates capability with luck. Running k=3-5 trials and tracking both metrics separates "the agent can do it" from "the agent reliably does it." |
| Task design | Narrow, specific tasks with unambiguous success criteria | Broad, realistic end-to-end scenarios | Narrow tasks produce cleaner signal for development. Broad tasks are better for final validation but harder to grade and debug. Start narrow, add broad tasks later. |
| Environment | Isolated sandbox per trial (fresh state) | Shared state across trials | Shared state creates ordering dependencies — trial 3 might pass only because trial 2 left artifacts. Isolation ensures each trial is an independent measurement. |
| Grader calibration | Human-calibrated model graders (validate against human scores) | Fixed rubric without human baseline | Model graders drift and hallucinate scores. Periodic human calibration (score 50-100 tasks manually) catches grader bugs before they corrupt metrics. |
Interview Talking Points
- Agent evals require fundamentally different infrastructure than model evals: Traditional model evals test single input-output pairs. Agent evals must orchestrate multi-step execution with tool use, manage sandboxed environments, capture full transcripts, and handle non-deterministic outputs across 3-5+ trials per task. The evaluation harness is itself a complex system.
- The pass@k vs pass^k gap is the most actionable metric for agent reliability: If pass@5 = 90% but pass^5 = 20%, the agent has the capability but fails 4 out of 5 times. This 70-point gap directly quantifies flakiness and tells you exactly where to invest — read the failing transcripts to find the systematic failure mode.
- Code-based graders should be the default, not model-based: Model-based graders add a second source of non-determinism on top of the agent's own variability. For coding agents, SWE-bench uses test suite execution — deterministic, zero-cost, and perfectly reproducible. Reserve LLM judges for genuinely subjective outputs like conversation quality.
- The Swiss Cheese model of evaluation layers catches what no single method can: Automated evals, production monitoring, A/B testing, user feedback, manual review, and human studies each have blind spots. Like Swiss cheese slices, each layer has holes — but stacking 6 layers makes it unlikely a serious regression passes through all of them undetected.
- Eval suite saturation is a hidden failure mode: When pass rates hit 90-95%, teams celebrate — but the eval has stopped providing signal. Saturated benchmarks give false confidence. The fix is continuously adding harder tasks and retiring solved ones, treating the eval suite as a living system that evolves with the agent.
- Harvesting manual tests is the highest-ROI eval strategy: Every manual test session generates 5-10 potential eval tasks for free. Teams that systematically convert manual testing into automated evals build 100+ task suites in weeks. Teams that try to design eval suites from scratch often stall at 20-30 tasks.
- Environment isolation is non-negotiable for meaningful metrics: Shared state across trials creates hidden dependencies — trial 3 passes because trial 2 left a file behind. Running each trial in a fresh sandbox with pinned dependencies is the only way to ensure each measurement is independent. The infrastructure cost pays for itself in debuggability.
- Transcript reading is the most underrated eval practice: Aggregate metrics show what failed but not why. Reading 10-20 failing transcripts per week reveals systematic patterns — the agent always fails at the same step, misuses a specific tool, or gives up too early. These patterns are invisible in dashboards but obvious in transcripts.