Cursor

Real-Time Code Completion System — 1M+ QPS LLM serving with speculative edits

graph TD subgraph Client["Client Layer (VS Code Fork — TypeScript/Electron)"] U["User types code"] --> CC["Context Collector"] CC --> |"Gathers surrounding code,
imports, file tree"| ENC["Local Encryption"] end subgraph Infra["Infrastructure Layer"] CF["Cloudflare
Reverse Proxy / TLS / DDoS"] end ENC --> CF CF --> BE["Monolithic Backend
TypeScript + Rust"] subgraph Server["Server Layer (AWS CPU + Azure H100 GPUs)"] BE --> DEC["Decrypt Request"] DEC --> ROUTER["Model Router"] ROUTER --> |"Autocomplete
low-latency"| FW["Fireworks
Fine-tuned Models"] ROUTER --> |"Chat / Summarize"| OAI["OpenAI
GPT Models"] ROUTER --> |"Reasoning"| ANT["Anthropic
Claude Models"] ROUTER --> |"Multimodal"| GCP["Google Vertex AI
Gemini"] FW --> SPEC["Speculative Edits Engine"] SPEC --> |"Fine-tuned Llama-3-70b
13x speedup"| RESP["Response Builder"] OAI --> RESP ANT --> RESP GCP --> RESP end RESP --> CF CF --> |"Sub-second latency"| RENDER["Render Suggestions
in Editor"] subgraph Indexing["Codebase Indexing Pipeline"] FILES["Local Codebase Files"] --> MERKLE["Merkle Tree Hash
Sync every 3 min"] MERKLE --> |"Detect changed files"| CHUNK["Code Chunker"] CHUNK --> EMB["Embedding Model"] EMB --> TP["Turbopuffer
Vector DB"] end TP --> |"Semantic code search
for context retrieval"| CC subgraph Shadow["Shadow Workspace"] EDIT_REQ["Edit Request"] --> SW["Hidden VSCode Window"] SW --> |"AI applies edits"| LINT["Lint Check"] LINT --> |"Pass/Fail + diagnostics"| REPORT["Report Back to User"] end subgraph Training["Custom Training Pipeline"] UD["Real User Data"] --> CB["Cursor Bench
Custom Benchmark"] CB --> RL["RL Training
PyTorch + Ray"] RL --> |"Thousands of GPUs"| CM["Composer Model"] CM --> |"Evaluates correctness +
codebase abstraction adherence"| FW end subgraph Monitoring["Observability"] DD["Datadog Monitoring"] PG["PostgreSQL
Primary Datastore"] end BE --> DD BE --> PG style Client fill:#1a2744,stroke:#58a6ff,stroke-width:2px,color:#e6edf3 style Server fill:#1a2233,stroke:#bc8cff,stroke-width:2px,color:#e6edf3 style Indexing fill:#1a2a22,stroke:#3fb950,stroke-width:2px,color:#e6edf3 style Shadow fill:#2a2219,stroke:#d29922,stroke-width:2px,color:#e6edf3 style Training fill:#2a1a22,stroke:#f85149,stroke-width:2px,color:#e6edf3 style Monitoring fill:#1a2233,stroke:#8b949e,stroke-width:2px,color:#e6edf3 style Infra fill:#1a2233,stroke:#39d2c0,stroke-width:2px,color:#e6edf3

Problem Statement

Build a real-time code completion system that serves 1M+ queries per second at peak, providing context-aware suggestions with sub-second latency. The system must deeply understand an entire codebase (not just the open file), support multi-model routing for different task types (autocomplete, chat, reasoning, code edits), and maintain strict user privacy through client-side encryption. Beyond simple completions, the system must also handle "speculative edits" — predicting and applying multi-line code changes — with enough speed that users experience them as instantaneous.

Core ML tasks: (1) Next-token code prediction at extreme throughput, (2) Semantic code retrieval via embeddings for context augmentation, (3) Reinforcement learning from real user interactions to continuously improve suggestion quality.

Architecture Overview

  • Client (VS Code fork): TypeScript/Electron app collects code context — surrounding lines, imports, file tree structure — and encrypts it locally before transmission. The client manages the user experience and rendering of suggestions.
  • Codebase Indexing: Files are chunked, embedded, and stored in Turbopuffer (vector DB). A Merkle tree hash mechanism syncs every 3 minutes to detect file changes; only modified files are re-indexed. Raw code is never stored server-side — only embeddings — preserving privacy.
  • Infrastructure: Cloudflare provides reverse proxy, TLS termination, and DDoS protection. AWS handles CPU workloads; Azure provides tens of thousands of H100 GPUs dedicated to inference. The backend is a monolithic TypeScript + Rust service (Rust for performance-critical paths) backed by PostgreSQL.
  • Model Router: Requests are dispatched to the optimal model provider based on task type — Fireworks for low-latency autocomplete (custom fine-tuned models), OpenAI GPT for chat and summarization, Anthropic Claude for deep reasoning, and Google Vertex AI for Gemini access.
  • Speculative Edits: A novel variant of speculative decoding optimized for code. Because existing code provides a strong guess of the output, the system uses longer speculation windows. A fine-tuned Llama-3-70b served via Fireworks achieves a 13x speedup over vanilla decoding and 9x over GPT-4 on the "Fast Apply" task (planning + applying code changes).
  • Shadow Workspace: A hidden VSCode window is spawned where AI performs edits, runs lint checks, and reports diagnostics — all without affecting the user's active coding session. This enables iterative AI refinement before surfacing suggestions.
  • Custom Training: The Composer model is trained via reinforcement learning on real user data using PyTorch + Ray distributed across thousands of GPUs. "Cursor Bench," a custom benchmark built from real agent requests by real users, evaluates both correctness and adherence to codebase abstractions.
  • Monitoring: Datadog for full-stack observability. PostgreSQL (migrated from Yugabyte for simplicity) as the primary datastore.

Key Design Decisions

Decision Why Tradeoff
Monolithic backend (single TypeScript + Rust service) Maximizes developer velocity — small team can iterate rapidly without cross-service coordination overhead Sacrifices independent scaling of components; risk of coupling. Acceptable at their stage because development speed is the bottleneck, not operational complexity.
Multi-provider model routing (Fireworks, OpenAI, Anthropic, Google) Each provider excels at a different task profile — latency-sensitive autocomplete vs. deep reasoning vs. multimodal understanding Increased vendor dependency and integration complexity. Mitigated by having custom fine-tuned models (via Fireworks) for the highest-QPS path (autocomplete), keeping the critical path under their control.
Speculative decoding with long speculation windows Code edits have high predictability (existing code is a strong prior), so longer speculations pay off more than in general text generation Wasted compute when speculation is wrong. The 13x speedup demonstrates that the hit rate is high enough for code to make this overwhelmingly worthwhile.
Merkle tree + 3-minute sync for indexing Efficiently detects which files changed without diffing entire codebases; only re-indexes modified files Up to 3 minutes of staleness in the index. Acceptable because code context changes are usually in the active file (handled by real-time context), not the broader codebase.
Turbopuffer for vector storage (replacing prior solution) Purpose-built vector DB optimized for embedding search at scale, with the right cost/performance profile for their access patterns Newer, less battle-tested infrastructure. Offset by the privacy benefit of storing only embeddings (not raw code) and better query performance.
PostgreSQL over Yugabyte Yugabyte's distributed complexity was unnecessary; PostgreSQL is simpler, better understood, and sufficient at their scale for relational data Gives up automatic horizontal sharding. Justified because the relational data (user metadata, settings) doesn't require distributed scale — the heavy lifting is on the GPU inference side.
Client-side encryption + embeddings-only storage Enterprise customers require strong privacy guarantees — code never persists on servers in raw form Limits server-side debugging and data analysis capabilities. Necessary for enterprise adoption and user trust.
RL training on real user data with custom benchmark Standard code benchmarks (HumanEval, etc.) don't capture real-world coding patterns — editing existing code, following project conventions, multi-file changes Expensive to collect and curate; potential bias toward power-user patterns. The custom "Cursor Bench" built from real agent requests provides far more realistic evaluation than synthetic benchmarks.

Interview Talking Points

  • Scale separation: Autocomplete (1M+ QPS, ~100ms budget) is a fundamentally different serving problem than chat/reasoning (~1 QPS per user, seconds acceptable). Cursor solves this by routing to different model providers — Fireworks for the latency-critical path, heavier models for deeper tasks.
  • Speculative decoding for code: Code edits are uniquely suited for speculative decoding because existing code acts as a strong prior. The system uses longer speculation windows than typical text generation, achieving 13x speedup with a fine-tuned Llama-3-70b. This is a great example of adapting a general ML technique to a domain-specific advantage.
  • Merkle tree indexing: Rather than re-indexing the entire codebase on every change, Cursor uses Merkle tree hashing (synced every 3 minutes) to detect only modified files. This is the same data structure Bitcoin uses for transaction verification — an elegant application of a well-known CS concept to reduce redundant embedding computation.
  • Privacy-preserving retrieval: Only embeddings are stored server-side, never raw code. This is a critical design constraint that shapes the entire architecture — you cannot do server-side code analysis, but you gain enterprise trust. Discuss how this constraint limits certain features (e.g., server-side refactoring analysis) while enabling others (enterprise adoption).
  • Shadow Workspace pattern: Spawning a hidden VSCode instance for AI to iterate in is a clever separation of concerns — the AI gets a full IDE environment (linting, type checking) without risking the user's active session. This is an example of using existing tools as infrastructure rather than rebuilding them.
  • Monolith over microservices: Cursor deliberately chose a monolithic backend despite 1M+ QPS scale. This is a great discussion point about when microservices are premature — at their team size, developer velocity from a single deployable unit outweighs the operational benefits of independent scaling.
  • Custom benchmarks over standard ones: "Cursor Bench" uses real agent requests from real users, evaluating not just correctness but adherence to codebase abstractions and conventions. This is a strong stance on evaluation — standard benchmarks like HumanEval measure isolated function generation, not the messy reality of editing existing code across multiple files.
  • Infrastructure pragmatism: The migration from Yugabyte to PostgreSQL illustrates an underappreciated principle: choose the simplest technology that solves your problem. Distributed databases add operational complexity that wasn't justified by their relational data access patterns. The heavy scaling challenge is on the GPU inference side, not the metadata side.