Anthropic

Multi-Agent System — Orchestrating parallel agents with context engineering

Architecture Overview

Anthropic's multi-agent architecture composes simple, well-defined patterns — orchestrator-workers, parallel exploration, tool use, and context engineering — into a system where multiple Claude agents cooperate on complex research queries. An outer agent harness provides checkpoint/resume durability for long-running tasks that outlive a single context window.

graph TD subgraph Harness["Agent Harness (Long-Running Durability)"] direction TB CP["Checkpoint / State Serialization"] RS["Resume & Retry on Failure"] MW["Multi-Context-Window Persistence"] end UQ["User Query"] --> Router["Router Agent
(Classify & Route)"] Router --> Orchestrator["Orchestrator Agent
(Decompose into subtasks)"] subgraph ParallelExploration["Parallel Exploration (Fan-Out)"] direction LR W1["Worker Agent A
(Angle 1)"] W2["Worker Agent B
(Angle 2)"] W3["Worker Agent C
(Angle 3)"] end Orchestrator --> W1 Orchestrator --> W2 Orchestrator --> W3 subgraph ToolUse["Tool Use Layer"] direction TB Think["Think Tool
(Stop & Reason)"] MCP["MCP Code Execution
(Write code to invoke tools)"] DirectAPI["Direct Tool Calls
(Search, Retrieve, Execute)"] end W1 --> Think W2 --> MCP W3 --> DirectAPI subgraph ContextEng["Context Engineering Layer"] direction TB Summarize["Summarization
(Compress prior turns)"] Retrieve["Retrieval
(Pull relevant context)"] Curate["Curate & Write Context
(Finite window management)"] end Think --> Summarize MCP --> Retrieve DirectAPI --> Curate Summarize --> Fusion["Fusion Agent
(Aggregate & Synthesize)"] Retrieve --> Fusion Curate --> Fusion subgraph QualityLoop["Evaluator-Optimizer Loop"] direction LR Eval["Evaluator Agent
(Score quality)"] Refine["Optimizer Agent
(Refine output)"] Eval -->|"Below threshold"| Refine Refine -->|"Re-evaluate"| Eval end Fusion --> Eval Eval -->|"Meets threshold"| FinalAnswer["Final Comprehensive Answer"] Harness -.->|"Wraps entire pipeline"| Router Harness -.->|"Checkpoints at each stage"| Fusion style Harness fill:#1a1a2e,stroke:#bc8cff,stroke-width:2px,color:#e6edf3 style ParallelExploration fill:#1c2333,stroke:#58a6ff,stroke-width:2px,color:#e6edf3 style ToolUse fill:#1c2333,stroke:#3fb950,stroke-width:2px,color:#e6edf3 style ContextEng fill:#1c2333,stroke:#d29922,stroke-width:2px,color:#e6edf3 style QualityLoop fill:#1c2333,stroke:#f85149,stroke-width:2px,color:#e6edf3 style UQ fill:#222d3f,stroke:#58a6ff,color:#e6edf3 style FinalAnswer fill:#222d3f,stroke:#3fb950,color:#e6edf3 style Router fill:#222d3f,stroke:#bc8cff,color:#e6edf3 style Orchestrator fill:#222d3f,stroke:#bc8cff,color:#e6edf3 style Fusion fill:#222d3f,stroke:#d29922,color:#e6edf3

Composable Agent Patterns

Anthropic advocates starting with the simplest pattern that works and adding complexity only when demonstrably needed. Each pattern is a building block, not a framework.

Pattern Structure When to Use Example
Prompt Chaining LLM call → gate → LLM call (sequential) Tasks decomposable into fixed sequential steps Generate outline → validate structure → write full draft
Routing Classify input → route to specialized handler Distinct input categories needing different processing Customer query → billing / technical / account handler
Parallelization Fan-out multiple LLM calls → aggregate results Independent subtasks that can run concurrently Research query explored from 3 angles simultaneously
Orchestrator-Workers Central LLM delegates subtasks to worker LLMs Complex tasks requiring dynamic subtask decomposition Code refactor: orchestrator plans, workers edit files
Evaluator-Optimizer Generator LLM + Evaluator LLM iterate in a loop Tasks with clear quality criteria and iterative improvement Generate code → evaluate correctness → refine until passing

Tool Use Architecture

Agents are defined as LLM + tools in a loop with a stopping condition. Anthropic's tool use design scales from simple direct calls to sophisticated code-generation-based invocation.

  • "Think" Tool: A special tool that lets the agent pause and reason explicitly during complex multi-step tool use. The model outputs its chain-of-thought as a tool call, then continues execution. This improves accuracy on tasks requiring careful planning before acting.
  • Direct Tool Calls: Traditional function-calling interface — the agent selects a tool name and provides structured arguments. Simple and reliable for well-defined APIs with a small tool set.
  • MCP Code Execution: Instead of calling tools directly, the agent writes code that invokes tools programmatically. This scales far better when the tool surface is large (hundreds of endpoints) because the agent can compose, loop, and branch over tool calls in code rather than making one-at-a-time function calls.
  • Dynamic Tool Discovery: The agent discovers available tools at runtime rather than having all tool schemas loaded into the prompt. This keeps the context window lean and enables open-ended tool ecosystems.
  • Self-Optimizing Tooling: Claude is used to evaluate and improve its own tool definitions — rewriting descriptions, parameter schemas, and examples to maximize the success rate of tool invocations.

Context Engineering

The context window is a finite, valuable resource. Context engineering is the discipline of curating what goes into it — writing the context rather than blindly passing everything in.

  • Summarization: Compress prior conversation turns and intermediate results into concise summaries. Long multi-turn exchanges are periodically condensed so the agent retains relevant history without exhausting the window.
  • Retrieval-Augmented Context: Pull in only the most relevant documents, code snippets, or knowledge — scored by relevance — rather than including entire corpora. The retrieval system acts as a gatekeeper for the context window.
  • Tool-Augmented Context: Rather than pre-loading all potentially useful information, the agent uses tools to fetch exactly what it needs on demand, keeping the base context minimal.
  • Write, Don't Pass: Explicitly author context instructions, system prompts, and structured data. Hand-crafted context consistently outperforms naive "dump everything in" approaches.
  • Context Budgeting: Allocate portions of the context window to different purposes — system instructions, retrieved knowledge, conversation history, scratch space — and enforce limits on each.

Agent Harness for Long-Running Tasks

Real-world agent tasks often exceed a single context window in duration or complexity. The agent harness applies patterns from traditional software engineering — checkpointing, state serialization, and resumable workflows — to make agents durable.

  • Multi-Context-Window Persistence: An agent can span multiple context windows by serializing its state (progress, intermediate results, plan) at checkpoints. A new context window is initialized with the serialized state, allowing the agent to continue seamlessly.
  • Checkpointing: At defined milestones (after each subtask, after each worker returns, after each evaluation loop), the harness saves the agent's full state. If the process crashes or the context window fills, work is not lost.
  • Failure Handling & Retries: When a tool call fails, an API times out, or an agent produces low-quality output, the harness retries the specific failed step rather than restarting the entire pipeline. Partial progress is preserved.
  • State Serialization: Agent state is represented as a structured object (task plan, completed steps, accumulated results, pending subtasks) that can be serialized to storage and deserialized into a fresh context window.
  • Graceful Degradation: If an agent cannot complete a subtask after retries, the harness returns partial results with clear annotations about what succeeded and what did not, rather than failing silently.

Key Design Decisions

Decision Chosen Approach Alternative Rationale
Composition model Simple composable patterns (chaining, routing, parallelization) Heavyweight agent framework (AutoGen, CrewAI, LangGraph) Composable primitives are easier to debug, test, and reason about. Frameworks add abstraction layers that obscure failure modes. Start simple, add complexity only when measured improvement justifies it.
Exploration strategy Parallel exploration + fusion (fan-out workers, merge results) Sequential chain-of-thought (single agent explores all angles one by one) Parallel exploration reduces wall-clock latency linearly with number of workers. It also produces more diverse perspectives since agents explore independently without anchoring on each other's early conclusions.
Tool invocation method Code execution via MCP (agent writes code to call tools) Direct API tool calls (structured function calling) Code execution scales to large tool surfaces — the agent can loop, branch, and compose calls in code. Direct calls work for small tool sets but become unwieldy with hundreds of endpoints. Code also enables the agent to handle data transformations inline.
Long-running durability Multi-context-window persistence with checkpoint/resume Single-shot: run everything in one context window Complex research tasks exceed single context windows. Checkpoint/resume enables the agent to survive crashes, handle context window limits, and maintain progress across arbitrarily long tasks without information loss.
Context management Curated context engineering (summarize, retrieve, budget) Pass full history and all documents into context Context windows are finite and performance degrades with irrelevant content. Curated context keeps the signal-to-noise ratio high, improves accuracy, and leaves room for the agent's own reasoning.
Quality assurance Evaluator-optimizer loop with explicit scoring Single-pass generation with no self-evaluation LLM outputs are non-deterministic. An evaluation loop catches errors and hallucinations, iteratively improving output until it meets a quality threshold. Cost of extra LLM calls is offset by reduced need for human review.

Interview Talking Points

  • Start simple, add complexity only when needed: Anthropic's core philosophy is that agents should begin as single LLM calls with tools. Layer in orchestration patterns (routing, parallelization, evaluation loops) only when you can measure the improvement. This directly contrasts the "reach for a framework first" instinct many candidates have.
  • An agent is LLM + tools + loop + stopping condition: This is the clearest mental model. The LLM decides what to do, tools execute actions, the loop continues until a stopping condition is met (task complete, quality threshold reached, or budget exhausted). Everything else is implementation detail.
  • Parallel exploration produces better results than sequential: When multiple agents explore different angles of a complex query simultaneously, you get diverse perspectives without anchoring bias. The fusion step synthesizes these into a more comprehensive answer than any single sequential chain would produce.
  • Context engineering is as important as model quality: The same model produces dramatically different results depending on what is in its context window. Strategies like summarization, retrieval-augmented context, and explicit context budgeting directly improve agent accuracy. In an interview, showing you think about context curation signals senior-level understanding.
  • Code execution for tool use is a scaling breakthrough: Direct function calling works for 5-10 tools, but breaks down at 100+. Having the agent write code that invokes tools lets it compose, loop, and branch — handling complex tool orchestration that structured function calls cannot express. This is a key architectural insight from Anthropic's MCP work.
  • Long-running agents need software engineering patterns: Checkpointing, state serialization, and resumable workflows are borrowed from distributed systems. The agent harness treats an LLM agent like a long-running process that can crash — saving progress, handling retries, and resuming from the last checkpoint. This is critical for production deployments.
  • The "Think" tool improves multi-step reasoning: Giving the agent an explicit mechanism to pause and reason (rather than rushing to the next tool call) measurably improves accuracy on complex tasks. It is a simple intervention — essentially a no-op tool that outputs chain-of-thought — but it prevents the agent from acting before thinking.
  • Evaluator-optimizer loops catch errors that single-pass generation misses: LLM outputs are stochastic. Adding a separate evaluator agent that scores output quality and triggers refinement loops brings reliability closer to production standards. The cost of extra LLM calls is typically far less than the cost of shipping hallucinated or incorrect results.