Architecture Overview

The Waymo Driver is a modular-hybrid autonomous driving system built on a Foundation Model with a "Think Fast, Think Slow" (System 1 / System 2) architecture. A sensor fusion encoder handles real-time perception while a driving VLM (fine-tuned from Gemini) reasons about rare and complex scenarios. Large teacher models are distilled into efficient students for onboard deployment under 10ms latency.

Based on Demonstrably Safe AI for Autonomous Driving and The Waymo World Model

graph TD subgraph Sensors["6th-Gen Sensor Suite"] direction LR CAM["13 Cameras
17MP, HDR"] LID["4 LiDARs
360° FOV, 300m+"] RAD["6 Radars
Imaging, all-weather"] EAR["Audio Receivers
Siren detection"] end subgraph Foundation["Waymo Foundation Model"] direction TB subgraph Sys1["System 1: Think Fast"] SFE["Sensor Fusion Encoder
Camera + LiDAR + Radar over time"] SFE --> OBJ["3D Objects + Semantics
+ Rich Embeddings"] end subgraph Sys2["System 2: Think Slow"] VLM["Driving VLM
Fine-tuned from Gemini"] VLM --> SEM["Complex Semantic Reasoning
Novel/rare situations"] end end subgraph Pipeline["Perception → Prediction → Planning"] direction LR PERC["Perception
3D Detection, Tracking,
Lane/Sign/Light"] PRED["Prediction
Multimodal Trajectory
Forecasting (6-64 modes)"] PLAN["Planning
IL + RL Hybrid
Trajectory Generation"] end subgraph Safety["Safety Architecture"] direction LR VAL["Onboard Validation Layer
Independent trajectory verification"] FALL["Deterministic Fallbacks
Safety-critical path"] end subgraph Virtuous["Virtuous Cycle"] direction TB DRIVER["Waymo Driver"] SIM["Waymo Simulator
World Model / Genie 3"] CRITIC["Waymo Critic
Driving quality evaluation"] DRIVER --> SIM --> CRITIC --> DRIVER end CAM --> SFE LID --> SFE RAD --> SFE EAR --> SFE CAM --> VLM OBJ --> PERC SEM --> PERC PERC --> PRED --> PLAN PLAN --> VAL VAL --> FALL subgraph Distill["Teacher-Student Distillation"] direction LR TEACH["Large Teacher Models
Max quality, cloud-trained"] STUD["Small Student Models
Real-time onboard, <10ms"] TEACH -->|"Distill"| STUD end Foundation -.-> TEACH style Sensors fill:#1c2333,stroke:#58a6ff,stroke-width:2px,color:#e6edf3 style Foundation fill:#1a1a2e,stroke:#bc8cff,stroke-width:2px,color:#e6edf3 style Sys1 fill:#1c2333,stroke:#3fb950,stroke-width:2px,color:#e6edf3 style Sys2 fill:#1c2333,stroke:#d29922,stroke-width:2px,color:#e6edf3 style Pipeline fill:#1c2333,stroke:#58a6ff,stroke-width:2px,color:#e6edf3 style Safety fill:#1c2333,stroke:#f85149,stroke-width:2px,color:#e6edf3 style Virtuous fill:#1c2333,stroke:#39d2c0,stroke-width:2px,color:#e6edf3 style Distill fill:#1c2333,stroke:#d29922,stroke-width:2px,color:#e6edf3

Sensor Suite & Perception

The 6th-generation Waymo Driver achieves a 42% reduction in total sensors vs. 5th-gen while improving performance, at under $20K per unit. Overlapping fields of view cover up to 500 meters in all conditions.

Sensor	Count	Key Specs	Role in Perception
Cameras	13 (down from 29)	17MP imager, HDR, superior low-light	Texture, color, traffic lights, signs, lane markings
LiDAR	4 (down from 5)	360° FOV, 300m+ range, cm-scale accuracy	3D geometry, precise depth, object detection regardless of appearance
Radar	6	Imaging radar, unprecedented resolution	Instantaneous velocity, all-weather robustness (rain/fog/snow)
Audio (EARs)	Array	External Audio Receivers	Emergency vehicle siren detection

SWFormer (Sparse Window Transformer): Converts 3D LiDAR points into sparse voxels, processes with self-attention within spatial windows plus cross-window correlation. 73.4 L2 mAPH on Waymo Open Dataset.
PVTransformer: Improved point-to-voxel aggregation achieving 76.1 L2 mAPH (+1.7 over SWFormer).
Sensor Fusion Encoder (System 1): Rapidly fuses camera, LiDAR, and radar inputs over time into objects, semantics, and rich embeddings. Runs in real-time onboard.
AutoML / NAS: Neural architecture search finds optimal quality-latency tradeoffs for onboard deployment.

Motion Forecasting

Given the current scene (road geometry, traffic lights, agent histories), predict the future trajectories of all agents as multimodal distributions. Models produce 6-64 trajectory hypotheses per agent with associated probabilities.

Model	Architecture	Key Innovation	Finding
MultiPath++	Multi-Context Gating (MCG)	Efficient fusion between agents and road elements; latent-space trajectory anchors	State-of-the-art on multiple benchmarks
Scene Transformer	Attention-based joint prediction	Predicts all agent trajectories jointly to capture interactions; supports conditioned prediction	Unified architecture for multi-agent forecasting
Wayformer	Multimodal attention family	Factorized attention + latent query attention (16x compression, no quality loss)	Early fusion, despite being simplest, achieves state-of-the-art

Planning & Decision Making

The planning system determines the ego vehicle's trajectory given perception outputs and predicted agent trajectories. Waymo combines imitation learning with reinforcement learning, validated by an independent onboard safety layer.

ChauffeurNet: Deep RNN trained via imitation learning on a bird's-eye-view representation. Trained on 30M+ expert driving examples. Key innovation: synthesizing perturbations (collisions, going off-road) to make IL robust.
IL + RL Hybrid: RL fine-tuning on top of imitation learning achieves a 38% reduction in safety events on the most difficult scenarios. Trained on 100K+ miles of real-world urban driving data.
Scaling Laws: Planning performance improves as a power-law function of compute. Unlike LLMs, optimal AV models are relatively smaller in size but require significantly more data — model size scales 1.5x faster than dataset size.
Onboard Validation: An independent validation layer verifies ML-generated trajectories before execution, with deterministic fallbacks for safety-critical situations.

EMMA: End-to-End Multimodal Model

Built on Gemini, EMMA takes raw camera images and text inputs and outputs trajectories, 3D detections, and road graph elements — all as natural language text tokens. A research model exploring the end-to-end frontier.

Chain-of-thought reasoning: 6.7% improvement in planning performance by letting the model reason before acting.
Joint training: Training on planning + perception + road graph understanding simultaneously improves all tasks vs. training individual models.
Current limitations: Camera-only (no LiDAR/radar), processes limited frames, computationally expensive — not yet suitable for real-time onboard inference.
Strategic role: Research vehicle for pushing performance boundaries; insights feed back into the modular production system.

World Model & Simulation

Built on Google DeepMind's Genie 3, the Waymo World Model generates photorealistic, controllable, multi-sensor driving scenes at scale — including scenarios impossible to capture in reality. The fleet drives ~20 million simulated miles per day.

Multi-sensor generation: Jointly produces temporally consistent camera imagery and LiDAR point clouds aligned with Waymo's real sensor stack — no domain gap.
Driving action control: Run "what if" counterfactual scenarios to test whether the Waymo Driver could have driven more confidently on alternative routes.
Scene layout control: Customize road geometry, traffic signal states, and other road user behavior through selective placement and layout mutations.
Language control: Adjust time-of-day, weather, or generate entirely synthetic scenes via natural language prompts (dawn/morning/noon/afternoon/evening/night; cloudy/foggy/rainy/snowy/sunny).
Long-tail generation: Synthesizes never-before-seen conditions — tornadoes, floods, elephants, T-rex costumes, car-sized tumbleweeds — by transferring world knowledge from 2D video pre-training into 3D LiDAR outputs.

Scale & Infrastructure

Metric	Value
Real-world autonomous miles	200M+ fully autonomous on public roads
Simulation miles	15B+ total; ~20M miles/day
Driving data corpus	500,000+ hours of driving data
Data rate per vehicle	Up to 1.8 TB/hour
Onboard inference latency	<10ms for most neural nets
Sensor suite cost (6th-gen)	<$20K (50%+ reduction from 5th-gen)
Safety record (56.7M miles)	84% fewer airbag-deployment crashes, 73% fewer injury crashes vs. humans
Paid rides per week	~400K (targeting 1M by end of 2026)

Key Design Decisions

Decision	Chosen Approach	Alternative	Rationale
Architecture	Modular-hybrid: Foundation Model trained end-to-end, deployed in modular structure with safety layers	Monolithic end-to-end (Tesla FSD)	"A monolithic architecture makes it easy to get started, but is wildly inadequate for safe, at-scale full autonomy" — Dmitri Dolgov. Modular allows independent testing, debugging, and safety certification per component.
Sensor suite	LiDAR + cameras + radar + audio (multi-sensor fusion)	Camera-only (Tesla)	LiDAR provides cm-scale 3D accuracy; radar sees through rain/fog/snow; redundant sensing critical for safety. Cost reduced to <$20K (6th-gen). Disengagement rate 0.0004/mile vs. higher for camera-only.
Planning training	Imitation learning + RL fine-tuning	Pure imitation learning	IL alone is brittle in edge cases. RL fine-tuning on safety-critical scenarios yields 38% reduction in safety events on the hardest scenarios.
Onboard deployment	Teacher-student distillation with NAS	Deploy full-size models with powerful hardware	Distillation achieves large-model quality at small-model latency (<10ms). NAS finds Pareto-optimal quality-latency architectures. Better scaling laws for distilled students.
Long-tail handling	Generative world model (Genie 3) + VLM reasoning (System 2)	Collect more real-world data	Rare events (<0.03% frequency) are impractical to capture at scale. World Model synthesizes arbitrary scenarios; VLM transfers internet-scale world knowledge to understand novel objects/situations.
Simulation fidelity	Generative world model producing real sensor outputs (camera + LiDAR)	Physics-based simulation (Carcraft)	Generative model eliminates sim-to-real domain gap — downstream autonomy systems consume simulation like real sensor logs. Language-controllable for rapid scenario iteration.

Interview Talking Points

Modular-hybrid beats pure end-to-end for safety-critical systems: Waymo trains a Foundation Model end-to-end for representation quality, but deploys it in a modular structure with an independent validation layer. This enables per-component testing, failure attribution, and deterministic fallbacks — essential for a system where errors cost lives, not just user experience.
Multi-sensor fusion provides irreplaceable redundancy: LiDAR gives cm-scale depth that monocular cameras cannot match; radar provides instantaneous velocity and sees through weather; cameras capture texture and semantics. The 6th-gen system proves cost is solvable (<$20K, 42% fewer sensors) — the safety margin is not. Disengagement rate of 0.0004/mile speaks for itself.
Teacher-student distillation is the deployment strategy: Train the largest, most accurate teacher models in the cloud, then distill to small students that run onboard in <10ms. This decouples model quality from inference latency — you can keep scaling teachers without redesigning the onboard stack. NAS further optimizes the quality-latency Pareto front.
Scaling laws for AV differ from LLMs: Waymo showed planning performance follows power-law scaling, but optimal AV models are relatively smaller and need much more data than LLMs. Model size scales 1.5x faster than dataset size. This insight shapes compute allocation: invest more in data diversity and coverage than raw model parameters.
World models solve the long-tail simulation problem: Rare events (<0.03% frequency) are the hardest to handle but most dangerous. The Waymo World Model (built on Genie 3) generates photorealistic multi-sensor scenarios — including physically impossible conditions (snow on Golden Gate Bridge, elephants on highways) — with zero sim-to-real domain gap because it outputs real sensor formats.
The virtuous cycle is the competitive moat: Driver, Simulator, and Critic share the same Foundation Model. Real-world driving feeds data mining, which feeds simulation, which feeds training, which improves the driver. 200M+ real miles and 15B+ sim miles create a flywheel competitors cannot easily replicate.
RL on top of IL is the planning breakthrough: Pure imitation learning is brittle at the edges — the model has never seen the expert recover from near-collisions it was never in. RL fine-tuning on safety-critical scenarios yields a 38% reduction in safety events. ChauffeurNet's innovation of synthesizing perturbations during IL training addresses the same insight from a data perspective.
EMMA shows where end-to-end is heading: Building on Gemini, EMMA achieves 6.7% planning improvement through chain-of-thought reasoning and proves joint training helps all tasks. But it's camera-only, processes few frames, and is too expensive for real-time — showing why the modular-hybrid approach remains necessary for production while E2E research advances.

Autonomous Driving System — Perception, prediction, and planning at 200M+ autonomous miles