Survey: Motion Planning & End-to-End VLM-Based Driving

Quick Survey: Motion Planning, Control, and End-to-End VLM-Based Reasoning for Autonomous Driving

Prepared for Waymo Visual Reasoning team interview with Wei-Chih Hung Last updated: March 2026


Overview

Autonomous driving is undergoing a paradigm shift from modular pipelines (perception -> prediction -> planning -> control) toward end-to-end learned systems that map sensor inputs directly to driving actions. This shift is accelerated by the emergence of Vision-Language Models (VLMs) and Vision-Language-Action (VLA) architectures that unify visual perception, natural language reasoning, and trajectory generation within a single framework. The promise is twofold: better generalization to long-tail scenarios through pre-trained world knowledge, and improved interpretability through chain-of-thought reasoning expressed in natural language.

The field has evolved through several phases: (1) classical trajectory optimization and rule-based planners (pre-2018); (2) imitation learning from human demonstrations (ChauffeurNet, 2018); (3) modular end-to-end models with differentiable intermediate representations (UniAD, VAD, 2023); (4) LLM/VLM-augmented driving systems (GPT-Driver, DriveVLM, 2023-2024); and (5) fully end-to-end multimodal models that represent all outputs as language tokens (EMMA, 2024). Each phase did not replace the previous one – rather, the field maintains active research across all paradigms, with the frontier now focused on scaling VLA models, closing the sim-to-real gap, and establishing reliable closed-loop evaluation.

Waymo has been a consistent contributor across this entire trajectory, from ChauffeurNet and MultiPath to MotionLM and EMMA. The Visual Reasoning team, led by researchers including Wei-Chih Hung, sits at the intersection of perception, scene understanding, and end-to-end planning – making EMMA a natural convergence point of their research directions in open-vocabulary panoptic segmentation (ECCV 2024) and VLM-based driving.


Timeline & Evolution

Year Paper/System Key Innovation Venue
2018 ChauffeurNet (Waymo) Imitation learning with synthesized perturbations for robust driving RSS 2019
2019 MultiPath (Waymo) Anchor-based multi-modal trajectory prediction with GMMs CoRL 2019
2021 MultiPath++ (Waymo) Efficient polyline encoding + trajectory aggregation ICRA 2022
2021 nuPlan (Motional) First closed-loop ML planning benchmark arXiv
2023 UniAD (SenseTime/OpenDriveLab) Unified perception-prediction-planning with query-based transformers CVPR 2023 Best Paper
2023 VAD Vectorized scene representation for efficient end-to-end planning ICCV 2023
2023 GameFormer (NVIDIA) Game-theoretic interactive prediction + planning ICCV 2023
2023 MotionLM (Waymo) Multi-agent motion forecasting as language modeling ICCV 2023
2023 GPT-Driver Motion planning reformulated as LLM language generation arXiv
2023 GAIA-1 (Wayve) Generative world model for driving video synthesis arXiv
2024 DriveVLM (Tsinghua) VLM with CoT for scene understanding + hierarchical planning arXiv
2024 LMDrive Closed-loop LLM driving with language instructions CVPR 2024
2024 DriveLM (OpenDriveLab) Graph VQA for structured driving reasoning ECCV 2024 Oral
2024 VADv2 Probabilistic planning, closed-loop CARLA SOTA arXiv
2024 DTPP (NVIDIA) Differentiable joint conditional prediction + cost evaluation ICRA 2024
2024 EMMA (Waymo) End-to-end multimodal model: all outputs as text via Gemini TMLR
2024 Tesla FSD v12 Full end-to-end neural net replacing 300K lines of C++ Production
2025 S4-Driver (Waymo/UC Berkeley) Self-supervised E2E driving MLLM with no human annotations; sparse volume 3D lifting CVPR 2025
2025 VLA Survey papers Systematization of VLA4AD into end-to-end vs dual-system paradigms ICCV 2025 Workshop
2025 Scaling Laws for Driving (Waymo) First empirical scaling laws for joint motion forecasting and planning arXiv
2025 WOD-E2E (Waymo) Long-tail E2E driving benchmark with Rater Feedback Score metric arXiv
2025 Waymo Foundation Model “Think Fast / Think Slow” dual-system production architecture with Driving VLM Waymo Blog
2026 FROST-Drive (Dong et al.) Frozen VLM encoder + adapter for E2E driving; optimizes for RFS on WOD-E2E WACV 2026 Workshop
2026 Waymo World Model Genie 3-based photorealistic 3D simulation Waymo Blog

1. EMMA Deep-Dive

EMMA: End-to-End Multimodal Model for Autonomous Driving Hwang, Xu, Lin, Hung, Ji, Choi, Huang, He, Covington, Sapp, Zhou, Guo, Anguelov, Tan (Waymo, 2024) arXiv:2410.23262 | Published in TMLR

Architecture

Component Detail
Backbone Gemini 1.0 Nano-1 (smallest Gemini variant)
Input Raw multi-camera images (up to 4 frames) + text prompts
Output Natural language text encoding trajectories, 3D detections, road graphs
Training End-to-end fine-tuning of the pre-trained MLLM on driving tasks

Key Design Decisions

Text-based output representation. All waypoint coordinates are represented as plain text floating-point numbers (not specialized tokens). Future trajectories are expressed as waypoint sets in BEV space: O_trajectory = {(x_t, y_t)} for t=1..T_f. This allows the model to leverage the pre-trained language model’s numerical reasoning capabilities without custom tokenizers.

Task-specific prompts. The same model handles multiple tasks by switching prompts:

Chain-of-thought reasoning. EMMA generates a structured reasoning chain before outputting trajectories:

This CoT approach improves planning performance by 6.7% over the baseline without reasoning.

Multi-task co-training. Joint training across planning, detection, and road graph tasks yields improvements in all three domains – a key finding supporting the unified architecture thesis.

Quantitative Results

Benchmark Metric EMMA EMMA+ Previous SOTA
nuScenes Planning Avg L2 (m) 0.32 0.29 0.39 (DriveVLM-Dual)
WOMD ADE@1s (m) 0.030
WOMD ADE@5s (m) 0.610
WOD 3D Detection Vehicle Precision +16.3% relative

EMMA+ uses additional internal pre-training data

Key Limitations (from the paper)

  1. Limited temporal context: Only processes up to 4 frames; cannot capture long-term dependencies
  2. No 3D sensor fusion: Cannot integrate LiDAR/radar due to MLLM architecture constraints
  3. Consistency gaps: No guarantee that planning and perception outputs are mutually consistent
  4. Expensive closed-loop eval: Sensor simulation costs several times more than behavior simulation
  5. Deployment latency: Large model requires distillation or optimization for real-time inference

Why EMMA Matters

EMMA represents a bet that foundation model pre-training (via Gemini) provides enough world knowledge to compensate for limited driving-specific training data, and that natural language as a universal interface can unify the fragmented autonomous driving stack. If the approach scales with larger models and more data, it could fundamentally change how AV systems are built.


2. End-to-End Autonomous Driving Models

ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

Bansal et al. (Waymo, 2018) | arXiv:1812.03079 | RSS 2019

MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction

Chai et al. (Waymo, 2019) | arXiv:1910.05449 | CoRL 2019

UniAD: Planning-Oriented Autonomous Driving

Hu et al. (SenseTime / Shanghai AI Lab, 2023) | arXiv:2212.10156 | CVPR 2023 Best Paper

VAD / VADv2: Vectorized Autonomous Driving

Jiang et al. (HUST, 2023-2024) | arXiv:2303.12077 (ICCV 2023) | arXiv:2402.13243 (VADv2)

Tesla FSD v12-v13 (Production, 2024-2025)

S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation

Xie, Xu, He, Hwang, Luo, Ji, Lin, Chen, Lu, Leng, Anguelov, Tan (UC Berkeley, Waymo, Cornell, Georgia Tech, 2025) | arXiv:2505.24139 | CVPR 2025

FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision Encoder

Dong, Zhu, Wu, Sun (2026) | arXiv:2601.03460 | WACV 2026 LLVM-AD Workshop


3. VLMs/LLMs for Driving Reasoning

GPT-Driver: Learning to Drive with GPT

Mao et al. (2023) | arXiv:2310.01415

DriveVLM: Convergence of Autonomous Driving and Large VLMs

Tian et al. (Tsinghua / BYD, 2024) | arXiv:2402.12289

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

Shao et al. (Shanghai AI Lab, 2024) | arXiv:2312.07488 | CVPR 2024

DriveLM: Driving with Graph Visual Question Answering

Sima et al. (OpenDriveLab, 2024) | arXiv:2312.14150 | ECCV 2024 Oral


4. Classical vs. Learned Motion Planning

Approach Method Strengths Weaknesses
Trajectory Optimization Minimize cost function over trajectory space (comfort, safety, progress) subject to vehicle dynamics constraints Interpretable, safety guarantees, handles constraints explicitly Requires hand-designed cost functions; struggles with complex multi-agent interactions
Sampling-based Generate candidate trajectories, evaluate and select (e.g., lattice planners, RRT variants) Handles non-convex constraints; Waymo’s production system uses elements of this Combinatorial explosion; quality depends on sampling strategy
Imitation Learning Learn policy from expert demonstrations (behavioral cloning, DAgger, ChauffeurNet) Learns complex behaviors from data; no reward engineering Distribution shift; causal confusion; struggles with rare events
Reinforcement Learning Learn policy by maximizing reward in simulation (PPO, SAC applied to driving) Can discover novel strategies; handles multi-agent interaction Reward shaping is difficult; sim-to-real gap; safety during training
End-to-End Learned Map sensors directly to trajectories/controls (UniAD, VAD, EMMA) Jointly optimized; no information bottleneck between modules Black-box; harder to verify safety; requires massive data
VLM/VLA-based Use pre-trained language models for reasoning + planning (EMMA, DriveVLM) World knowledge transfer; interpretable reasoning; instruction following High latency; limited spatial precision; hallucination risk

Key Insight for the Interview

Waymo’s trajectory shows a deliberate evolution: ChauffeurNet (pure IL) -> MultiPath (learned prediction, classical planning) -> MotionLM (language modeling for prediction) -> EMMA (language modeling for everything). Each step incorporates more learning while the production system likely maintains safety-critical classical components as fallbacks.


5. Joint Prediction + Planning Models

MotionLM: Multi-Agent Motion Forecasting as Language Modeling

Seff et al. (Waymo, 2023) | arXiv:2309.16534 | ICCV 2023

GameFormer: Game-Theoretic Interactive Prediction and Planning

Huang et al. (NVIDIA, 2023) | arXiv:2303.05760 | ICCV 2023

DTPP: Differentiable Joint Conditional Prediction and Cost Evaluation

Huang et al. (NVIDIA, 2024) | arXiv:2310.05885 | ICRA 2024

Scaling Laws of Motion Forecasting and Planning

Baniodeh, Goel, Ettinger, Fuertes, Seff, Shen, Gulino, et al. (Waymo, 2025) | arXiv:2506.08228


6. Safety and Interpretability

How VLM-Based Approaches Improve Explainability

Aspect Traditional E2E (UniAD, VAD) VLM-Based (EMMA, DriveVLM)
Decision transparency Intermediate representations (BEV, heatmaps) provide some insight but require expert interpretation Natural language reasoning chains explain why a decision was made in human-readable form
Failure analysis Requires probing internal activations Can inspect the textual CoT to identify reasoning errors
Human communication Cannot naturally explain behavior to passengers or operators Can generate explanations: “Slowing down because pedestrian stepping into crosswalk”
Instruction following Fixed behavior policy Can accept and act on natural language instructions
Regulatory compliance Difficult to audit internal decision process Text-based reasoning provides audit trail

Key Challenges

EMMA’s Specific Approach to Interpretability

EMMA’s four-stage CoT (R1-R4) provides structured interpretability:

  1. R1 (scene description) shows what the model perceives
  2. R2 (critical objects) shows what the model attends to
  3. R3 (behavior descriptions) shows the model’s predictions of others
  4. R4 (meta driving decision) shows the chosen action category

This improves planning by 6.7% while providing an inspection point at each reasoning stage.

WOD-E2E: Waymo Open Dataset for End-to-End Driving

Xu, Lin, Jeon, Feng, Zou, Sun, Gorman, et al. (Waymo, 2025) | arXiv:2510.26125


7. Key Waymo Research Contributions

Paper Year Contribution arXiv
ChauffeurNet 2018 IL for urban driving with synthesized perturbations; first major learned planner at Waymo 1812.03079
MultiPath 2019 Anchor-based multi-modal trajectory prediction using GMMs 1910.05449
MultiPath++ 2021 Efficient polyline scene encoding + trajectory aggregation 2111.14973
Waymo Open Dataset 2019 One of the largest AV datasets; used by 36K+ researchers worldwide 1912.04838
WOMD 2021 Waymo Open Motion Dataset for behavior prediction benchmarking 2104.10133
LET-3D-AP 2022 Longitudinal error tolerant 3D detection metric 2206.07705
MotionLM 2023 Motion forecasting as language modeling (discrete tokens, autoregressive) 2309.16534
SceneDiffuser 2024 Diffusion-based scene initialization + rollout for traffic simulation Waymo Research
3D OV Panoptic Seg (Hung et al.) 2024 Open-vocabulary 3D panoptic segmentation for driving 2401.02402
EMMA 2024 End-to-end multimodal model: Gemini backbone, text-based output 2410.23262
WOMD-Reasoning 2024 3M Q&A pairs for map recognition, motion narratives, interaction reasoning Waymo Open Dataset
S4-Driver 2025 Self-supervised E2E driving MLLM; no human annotations; sparse volume 3D lifting 2505.24139
Scaling Laws for Driving 2025 First empirical scaling laws for joint motion forecasting and planning 2506.08228
WOD-E2E 2025 Long-tail E2E driving benchmark with Rater Feedback Score metric 2510.26125
Waymo Foundation Model 2025 “Think Fast / Think Slow” dual-system production architecture with Driving VLM Waymo Blog
Waymo World Model 2026 Genie 3-based photorealistic 3D simulation for rare event testing Waymo Blog

Wei-Chih Hung’s Research Trajectory at Waymo

Wei-Chih Hung’s work traces a clear path toward EMMA:

  1. Semi-supervised segmentation (BMVC 2018) – learning from limited labels
  2. SCOPS: Self-supervised co-part segmentation (CVPR 2019) – unsupervised part discovery
  3. LET-3D-AP metrics (2022) – improving 3D detection evaluation for AV
  4. 3D Open-Vocabulary Panoptic Segmentation (ECCV 2024) – open-vocab understanding using VLMs
  5. EMMA (2024) – unifying perception + planning via VLMs

The through-line is: using large pre-trained models (CLIP, Gemini) to improve generalization in autonomous driving perception and planning, especially for open-world / long-tail scenarios.

Waymo Foundation Model: Demonstrably Safe AI (Blog, December 2025)


Active Research Frontiers

Problem Current State Key Challenge
Closed-loop evaluation nuPlan provides first real ML benchmark; CARLA widely used but unrealistic Real-world closed-loop testing is expensive; sim-to-real gap remains large
Scalability EMMA uses Gemini Nano (smallest); Tesla uses 35K+ H100s How to scale VLM-based planners to real-time on vehicle hardware?
Sim-to-real transfer World models (GAIA-2, Waymo World Model) generate photorealistic scenarios Generated scenarios may not cover the true distribution of rare events
Multi-sensor fusion in VLMs EMMA is camera-only; cannot integrate LiDAR VLM architectures not designed for 3D point cloud inputs
Consistency guarantees No current method guarantees consistent perception + planning outputs Formal verification of neural networks remains intractable at scale
Regulatory frameworks EU AI Act, NHTSA guidelines emerging How to certify a system whose reasoning is a neural network?
Long-tail scenarios WOD-E2E dataset targets <0.03% frequency events Requires either massive data or effective simulation of rare events
Model distillation Active research area Compress large VLMs to deploy on vehicle hardware without losing capability
  1. VLA (Vision-Language-Action) unification: Two paradigms crystallizing – (a) End-to-End VLA integrating everything in one model, (b) Dual-System VLA with slow VLM reasoning + fast reactive controller. EMMA exemplifies (a); DriveVLM-Dual exemplifies (b).

  2. World models for simulation: Waymo’s Genie 3-based world model and Wayve’s GAIA-2 can generate photorealistic, interactive driving scenarios including rare events (tornados, animals). These could transform closed-loop evaluation.

  3. Instruction-following driving: LMDrive, DriveLM show that natural language can serve as the interface between human intent and vehicle behavior. This has implications for ride-hailing UX.

  4. Tokenization of everything: MotionLM showed trajectories can be discretized into tokens; EMMA showed all outputs can be text. The trend is toward universal tokenization of driving primitives.

  5. Scaling laws for driving: Does more pre-training data + larger models reliably improve driving performance? EMMA’s use of Gemini Nano suggests Waymo is exploring this axis; EMMA+ (with more data) shows consistent gains.


Key Concepts & Terminology

Term Definition
BEV (Bird’s Eye View) Top-down 2D representation of the 3D scene, commonly used as the intermediate representation in E2E driving models
End-to-End (E2E) Systems that learn the full pipeline from raw sensors to control outputs, without hand-designed intermediate modules
VLA (Vision-Language-Action) Models that unify visual perception, language reasoning, and action generation
Chain-of-Thought (CoT) Technique where the model generates intermediate reasoning steps before producing a final answer
Open-loop evaluation Testing model outputs against recorded ground truth without simulating the effect of the model’s actions on the environment
Closed-loop evaluation Testing where the model’s actions affect the simulated environment, enabling interaction with other agents
Imitation Learning (IL) Learning a policy by mimicking expert demonstrations (e.g., human driving)
Behavioral Cloning (BC) Simplest form of IL: supervised learning on (state, action) pairs from expert data
Causal confusion When the model learns spurious correlations (e.g., brake lights -> slow down) instead of true causal relationships
Distribution shift Gap between training data distribution and deployment distribution, particularly problematic for IL
Motion tokens Discrete representation of continuous trajectory segments, used in MotionLM and related work
Anchor trajectories Pre-defined trajectory templates used to initialize multi-modal prediction (MultiPath)
nuScenes Large-scale AV dataset from Motional with 1000 scenes, widely used for perception and planning benchmarks
WOMD Waymo Open Motion Dataset, focused on behavior prediction with 100K+ scenes
nuPlan Closed-loop planning benchmark with 1500 hours of driving data from 4 cities
Dual-system architecture Inspired by dual-process theory: fast reactive system + slow deliberative system operating in parallel

For maximum understanding, read in this sequence:

Phase 1: Foundations (start here)

  1. ChauffeurNet (2018) – understand IL for driving and its limitations arXiv:1812.03079

  2. MultiPath (2019) – anchor-based multi-modal prediction arXiv:1910.05449

Phase 2: End-to-End Revolution

  1. UniAD (2023) – the CVPR Best Paper that defined E2E driving arXiv:2212.10156

  2. VAD (2023) – vectorized alternative, more efficient arXiv:2303.12077

Phase 3: Language Meets Driving

  1. MotionLM (2023) – Waymo’s bridge from prediction to language modeling arXiv:2309.16534

  2. GPT-Driver (2023) – first LLM-as-planner proof-of-concept arXiv:2310.01415

Phase 4: VLM-Based Driving Systems

  1. DriveVLM (2024) – CoT reasoning for driving + practical dual-system design arXiv:2402.12289

  2. DriveLM (2024) – structured Graph VQA for driving reasoning arXiv:2312.14150

  3. LMDrive (2024) – instruction-following closed-loop driving arXiv:2312.07488

Phase 5: EMMA and Beyond (read most carefully)

  1. EMMA (2024) – the paper your interviewer co-authored; know this cold arXiv:2410.23262

  2. S4-Driver (2025) – EMMA’s successor direction; self-supervised E2E driving without annotations arXiv:2505.24139

  3. Scaling Laws for Driving (2025) – validates scaling for motion forecasting/planning; compute allocation guidance arXiv:2506.08228

  4. WOD-E2E (2025) – long-tail benchmark that addresses EMMA’s evaluation limitations arXiv:2510.26125

  5. Waymo Foundation Model Blog (2025) – how EMMA’s ideas evolved into Waymo’s production architecture Waymo Blog, December 2025

  6. VLA4AD Survey (2025) – systematic overview of the field EMMA sits in arXiv:2512.16760

Bonus: Wei-Chih Hung’s Other Work

  1. 3D Open-Vocabulary Panoptic Segmentation (ECCV 2024) arXiv:2401.02402

Interview Preparation Notes

Questions You Should Be Ready to Discuss

  1. “How does EMMA compare to UniAD?” UniAD uses specialized transformer modules with intermediate BEV representations; EMMA unifies everything through a pre-trained MLLM with text outputs. EMMA trades architectural inductive bias for pre-trained world knowledge. UniAD may be more data-efficient for driving-specific tasks; EMMA may generalize better to novel scenarios.

  2. “What are the limitations of representing trajectories as text?” Precision loss from tokenization; no explicit geometric constraints; no guarantee of physically feasible trajectories; higher inference latency than direct regression. EMMA addresses precision with floating-point text representation but cannot enforce kinematic constraints.

  3. “How would you improve EMMA?” Potential directions: longer temporal context (more than 4 frames); multi-sensor fusion (LiDAR integration); consistency losses between perception and planning outputs; model distillation for deployment; reinforcement learning fine-tuning for closed-loop improvement.

  4. “Why use Gemini Nano instead of a larger model?” Likely latency constraints for real-time driving. An interesting research question is whether scaling to larger Gemini variants yields proportional gains, or whether driving-specific fine-tuning matters more than model size.

  5. “How do you evaluate E2E driving models fairly?” Open-loop (L2 on nuScenes) is insufficient – it cannot capture compounding errors or interaction effects. Closed-loop (CARLA, nuPlan) is better but sim-to-real gap is significant. Real-world closed-loop testing is gold standard but expensive. WOMD Sim Agents challenge is Waymo’s attempt at scalable closed-loop eval.


Topic Video Channel Link
E2E AD Tutorial End-to-end Autonomous Driving: Past, Current and Onwards OpenDriveLab https://youtu.be/Z4n1vlAYqRw
E2E AD Misconceptions Common Misconceptions in Autonomous Driving (Andreas Geiger) WAD at CVPR https://www.youtube.com/watch?v=x_42Fji1Z2M
DriveVLM DriveVLM Demo Video MARS Lab https://www.youtube.com/watch?v=mt-SdHTTZzA
Motion Planning Autonomous Driving: The Way Forward (Vladlen Koltun) WAD at CVPR https://youtu.be/rj7A2OP7KO4
Motion Forecasting Boris Ivanovic — CVPR 2025 OpenDriveLab Tutorial OpenDriveLab https://youtu.be/EWfdgvSd5b0
Tesla FSD AI for Full Self-Driving (Andrej Karpathy, CVPR 2021) WAD at CVPR https://www.youtube.com/watch?v=g6bOwQdCJrc
Tesla FSD Foundation Models for Autonomy (Ashok Elluswamy, CVPR 2023) WAD at CVPR https://www.youtube.com/watch?v=6x-Xb_uT7ts
Imitation Learning Feedback in Imitation Learning (ICML 2020 Workshop) ICML https://www.youtube.com/watch?v=4VAwdCIBTG8
Raquel Urtasun Keynote Self-Driving Keynote (CVPR 2021 WAD) WAD at CVPR https://youtu.be/PSZ2Px9PrHg

Survey compiled from web research. All paper details verified through arXiv and published sources. Where exact details could not be confirmed, this is noted explicitly.