Problem Statement
Traditional RAG systems split documents into chunks and embed them for semantic retrieval. However, individual chunks often lose critical context once separated from their source document. For example, a chunk stating "Revenue increased 3% over the previous quarter" becomes nearly useless without knowing which company, which quarter, or which revenue stream is being discussed.
This context loss creates a specific failure mode: the relevant chunk exists in the knowledge base but cannot be retrieved because neither its embedding nor its keywords carry enough information to match the user's query. Anthropic found this "failed retrieval" problem to be the dominant source of errors in production RAG systems, more damaging than the generation step itself.
Architecture Overview
Contextual Retrieval addresses failed retrievals through three layered techniques applied during ingestion and query time:
- Contextual Embeddings: Before embedding each chunk, an LLM is called with both the full document and the individual chunk. The LLM generates a short explanatory context snippet (e.g., "This chunk is from Acme Corp's Q2 2024 SEC filing, specifically the revenue breakdown section discussing North American operations"). This context is prepended to the chunk, and the enriched chunk is then embedded. The richer embedding captures the chunk's meaning within its document.
- Contextual BM25: The same enriched chunks are also indexed in a BM25 (keyword/lexical) index. This ensures that exact entity names, acronyms, and domain-specific terms that appear in the prepended context are searchable via keyword matching — catching cases where semantic similarity alone would fail.
- Hybrid Search + Reranking: At query time, the user's query is run against both the vector index (semantic) and the BM25 index (keyword). Results from both are merged, then passed through a reranker model (e.g., Cohere Rerank) that re-scores every candidate chunk against the original query. The top-K results after reranking are passed to the LLM for answer generation.
| Retrieval Strategy | Failed Retrieval Reduction | What It Adds |
|---|---|---|
| Contextual Embeddings only | 35% | Richer semantic representations via prepended context |
| Contextual Embeddings + BM25 (Hybrid) | 49% | Keyword matching catches exact terms embeddings miss |
| Hybrid + Reranking | 67% | Cross-attention reranker re-scores merged results with high precision |
Key Design Decisions
- Why hybrid search over embeddings-only? Embedding models compress meaning into dense vectors, which excel at capturing semantic similarity (synonyms, paraphrases) but can lose exact terms — especially rare entity names, product codes, or acronyms. BM25 is the inverse: it matches on exact token overlap and handles rare, specific terms well but misses paraphrases entirely. Combining both covers the full spectrum. Anthropic's data showed a 14 percentage-point improvement (35% to 49%) from adding BM25, confirming that a significant share of retrieval failures were keyword-match problems, not semantic ones.
- Why prepend context instead of other approaches? Alternative strategies include fine-tuning the embedding model on domain data, using larger chunk sizes, or adding metadata fields. Prepending context is appealing because it is model-agnostic (works with any off-the-shelf embedding model), requires no retraining, and directly addresses the root cause — the chunk text itself lacks context. The LLM-generated snippet acts as a "situational summary" that travels with the chunk through every downstream stage (embedding, BM25 indexing, reranking, generation).
- Why reranking matters: The initial retrieval step (both vector and BM25) uses lightweight, bi-encoder or term-matching models that score each chunk independently. A reranker uses a cross-encoder architecture: it jointly attends to the query and each candidate chunk together, producing much more accurate relevance scores. This is computationally expensive (O(n) forward passes for n candidates), so it is only applied to the top candidates after the fast first-stage retrieval. The jump from 49% to 67% reduction demonstrates that many relevant chunks are retrieved but ranked too low without reranking.
- Cost-quality tradeoff with prompt caching: Generating context for each chunk requires an LLM call with the full document as input — potentially expensive at scale. Anthropic's prompt caching optimization keeps the full document in the LLM's KV cache and only varies the per-chunk suffix. This reduces the marginal cost per chunk by approximately 90%, making it economically viable to enrich millions of chunks. The tradeoff is that caching works best when chunks from the same document are processed sequentially in a batch, requiring careful orchestration of the ingestion pipeline.
Interview Talking Points
- Root cause framing: "The core insight is that failed retrievals — where the right chunk exists but isn't found — are more damaging than generation errors, and they stem from chunks losing context when separated from their source document."
- The enrichment step: "Each chunk is sent to an LLM alongside the full document, and the LLM generates a concise context snippet that is prepended to the chunk before embedding. This is a one-time ingestion cost, not a query-time cost."
- Hybrid search rationale: "Embeddings handle semantic similarity well but miss exact keyword matches. BM25 catches those. Combining them reduced failed retrievals by 49%, compared to 35% with embeddings alone — meaning roughly a third of the remaining failures were keyword-match problems."
- Reranking as a precision layer: "A cross-encoder reranker jointly attends to query-chunk pairs, producing much more accurate relevance scores than bi-encoder retrieval. It's too expensive for the full corpus but highly effective on the top candidates from first-stage retrieval."
- Cost engineering: "Prompt caching keeps the document in the LLM's KV cache while varying only the chunk text per call, cutting enrichment cost by ~90%. This makes the approach practical at scale — you batch chunks from the same document sequentially to maximize cache hits."
- Layered improvement story: "Each technique compounds: contextual embeddings (35%), plus BM25 (49%), plus reranking (67%). In an interview, I'd frame this as a lesson in systematic ablation — each layer addresses a distinct failure mode."
- Model-agnostic design: "Prepending context works with any embedding model and any BM25 implementation. There's no fine-tuning required, which makes it easy to swap in better models later without re-engineering the pipeline."
- Tradeoff awareness: "The main tradeoff is ingestion latency and cost vs. retrieval quality. The context generation step adds an LLM call per chunk during ingestion, but prompt caching and batching mitigate this. Query-time latency is only marginally affected since the reranker runs on a small candidate set."