Survey: Segmentation & Scene Understanding for Autonomous Driving

Quick Survey: Segmentation & Scene Understanding for Autonomous Driving

Prepared for: Interview with Wei-Chih Hung, Waymo Visual Reasoning Team Seed paper: EMMA: End-to-End Multimodal Model for Autonomous Driving (arXiv:2410.23262) Date: March 2026


Overview

Segmentation and scene understanding for autonomous driving has evolved from standalone per-frame semantic labeling to unified, multi-task, multi-modal systems that jointly reason about perception, prediction, and planning. The core problem is: given raw sensor inputs (cameras, LiDAR, radar), produce a dense, structured understanding of the 3D driving scene – classifying every pixel/point into semantic categories, distinguishing individual object instances, tracking them over time, and ideally doing so for open-vocabulary (previously unseen) categories.

The field has progressed through several paradigm shifts: (1) from separate semantic and instance segmentation to panoptic segmentation (Kirillov et al., 2019), which jointly handles “things” (countable objects) and “stuff” (amorphous regions like road, sky); (2) from 2D image-plane reasoning to 3D volumetric/BEV representations, enabling direct fusion of camera and LiDAR data; (3) from closed-vocabulary fixed taxonomies to open-vocabulary segmentation using vision-language models like CLIP and SAM; and (4) from modular perception-then-planning pipelines to end-to-end multimodal models like EMMA that directly map sensor data to driving outputs via large language models.

Wei-Chih Hung’s research sits at the intersection of these trends. His recent work spans 3D open-vocabulary panoptic segmentation (ECCV 2024), camera-only 3D detection metrics (LET-3D-AP, ICRA 2024), multi-object tracking (STT, ICRA 2024), and the EMMA end-to-end driving model (2024). His earlier academic work focused on semi-supervised and self-supervised segmentation methods. Understanding this trajectory is key for the interview.


Wei-Chih Hung: Publication Profile

Year Paper Venue Role/Notes
2018 Adversarial Learning for Semi-Supervised Semantic Segmentation BMVC 2018 First author. Highly cited (~5k+ total profile citations). Used discriminator to enable semi-supervised seg.
2019 SCOPS: Self-Supervised Co-Part Segmentation CVPR 2019 First author. Discovers object parts without supervision using geometric priors.
2020 Mixup-CAM: Weakly-supervised Semantic Segmentation via Uncertainty Regularization BMVC 2020 Weakly-supervised segmentation with mixup augmentation.
2020 Weakly-Supervised Semantic Segmentation via Sub-Category Exploration CVPR 2020 Sub-category discovery for weakly-supervised learning.
2022 LET-3D-AP: Longitudinal Error Tolerant 3D Average Precision for Camera-Only 3D Detection ECCV 2022 / ICRA 2024 Co-author (Waymo). New metric addressing depth uncertainty in camera-only detectors.
2022 Waymo Open Dataset: Panoramic Video Panoptic Segmentation ECCV 2022 Waymo team. Introduced panoramic video panoptic segmentation benchmark.
2024 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation ECCV 2024 Co-author (Waymo). First method for 3D open-vocab panoptic seg using CLIP distillation.
2024 STT: Stateful Tracking with Transformers for Autonomous Driving ICRA 2024 Co-author (Waymo). Joint data association + state estimation.
2024 EMMA: End-to-End Multimodal Model for Autonomous Driving arXiv (v3: Sep 2025) Co-author (Waymo). End-to-end driving via Gemini-based MLLM.

Key themes in Hung’s work: Semi/weakly/self-supervised learning, segmentation (semantic, panoptic, open-vocabulary), evaluation metrics, end-to-end driving systems.

Note: The Visual Reasoning team’s work on open-vocabulary perception, tracking, and evaluation metrics feeds directly into Waymo’s newer systems – S4-Driver builds on the self-supervised perception paradigm, WOD-E2E benchmarks the E2E driving stack that EMMA pioneered, and the Waymo Foundation Model’s Sensor Fusion Encoder is the production realization of the multi-modal perception research.


Timeline & Evolution of the Field

Year Paper / Milestone Key Innovation
2018 Panoptic Segmentation (Kirillov et al.) Defined the panoptic segmentation task, unifying semantic + instance seg with PQ metric
2019 PointPillars (Lang et al.) Fast LiDAR 3D detection using pillar-based point encoding; 62 Hz real-time
2020 Panoptic-DeepLab (Cheng et al., CVPR) First real-time bottom-up panoptic segmentation; dual-decoder architecture
2021 Cylinder3D (Zhu et al., CVPR) Cylindrical partition for LiDAR semantic seg; SOTA on SemanticKITTI
2021 ViP-DeepLab (Qiao et al., CVPR) Video panoptic segmentation + monocular depth; temporal consistency
2021 CenterPoint (Yin et al., CVPR) Center-based 3D detection from LiDAR; anchor-free design
2022 BEVFusion (Liu et al., NeurIPS / ICRA’23) Unified camera-LiDAR fusion in BEV space; task-agnostic framework
2022 Mask2Former (Cheng et al., CVPR) Universal architecture for all segmentation tasks via masked attention transformer
2022 Waymo Panoramic Video Panoptic Seg (Mei et al., ECCV) Largest video panoptic seg benchmark: 100k images, 5 cameras, 28 classes
2023 OneFormer (Jain et al., CVPR) Single model trained once for all three segmentation tasks; task-conditioned tokens
2023 TPVFormer (Zheng et al., CVPR) Tri-perspective view for 3D occupancy prediction from cameras
2023 SurroundOcc (Wei et al., ICCV) Dense multi-camera 3D occupancy prediction with auto-generated labels
2023 UniAD (Hu et al., CVPR Best Paper) Unified perception-prediction-planning; end-to-end with transformer queries
2023 SAM (Kirillov et al., ICCV) Foundation model for promptable segmentation; zero-shot generalization
2024 3D Open-Vocab Panoptic Seg (Xiao, Hung et al., ECCV) First 3D open-vocabulary panoptic seg via CLIP-LiDAR distillation
2024 SAM 2 (Meta, ICLR 2025) Extends SAM to video; real-time promptable segmentation in videos
2024 EMMA (Hwang, Hung et al., Waymo) End-to-end MLLM for driving; maps camera data to trajectories/objects via Gemini
2025 OccMamba (CVPR 2025) State space models for efficient 3D semantic occupancy prediction
2025 S4-Driver (Xie, Xu et al., Waymo, CVPR) Self-supervised E2E driving with sparse 3D volume representation from 2D MLLM features
2025 Scaling Laws of Motion Forecasting (Baniodeh, Goel et al., Waymo) Power-law scaling for driving; model size grows 1.5x faster than data
2025 WOD-E2E (Xu, Lin et al., Waymo) Long-tail E2E driving benchmark; Rater Feedback Score metric
2025 Waymo Foundation Model (Blog) Think Fast/Think Slow: Sensor Fusion Encoder + Driving VLM + World Decoder

Detailed Paper Summaries

Category 1: The Seed Paper and End-to-End Driving

EMMA: End-to-End Multimodal Model for Autonomous Driving Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, et al. (Waymo, 2024)arXiv:2410.23262

UniAD: Planning-Oriented Autonomous Driving Yihan Hu et al. (Shanghai AI Lab / SenseTime, CVPR 2023 Best Paper)arXiv:2212.10156


Category 2: Panoptic Segmentation Foundations

Panoptic Segmentation Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, Piotr Dollar (CVPR 2019)arXiv:1801.00868

Panoptic-DeepLab Bowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh Chen (Google, CVPR 2020)arXiv:1911.10194

Mask2Former: Masked-Attention Mask Transformer for Universal Image Segmentation Bowen Cheng, Ishan Misra, Alexander Schwing, Alexander Kirillov, Rohit Girdhar (Meta/UIUC, CVPR 2022)arXiv:2112.01527

OneFormer: One Transformer to Rule Universal Image Segmentation Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi (SHI-Labs, CVPR 2023)GitHub


Category 3: Video Panoptic Segmentation & Tracking

ViP-DeepLab: Learning Visual Perception with Depth-Aware Video Panoptic Segmentation Siyuan Qiao, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen (Google/JHU, CVPR 2021)Paper

Waymo Open Dataset: Panoramic Video Panoptic Segmentation Jieru Mei, Alex Zihao Zhu, Xinchen Yan, Hang Yan et al. (Waymo, ECCV 2022)arXiv:2206.07704

STT: Stateful Tracking with Transformers for Autonomous Driving Waymo authors incl. Wei-Chih Hung (ICRA 2024)arXiv:2405.00236


Category 4: 3D LiDAR Segmentation

Cylinder3D: An Effective 3D Framework for Driving-Scene LiDAR Semantic Segmentation Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Yuexin Ma, Wei Li, Hongsheng Li, Dahua Lin (CUHK, CVPR 2021)arXiv:2008.01550

Instance Segmentation with Cross-Modal Consistency Alex Zihao Zhu et al. (Waymo, 2022)arXiv:2210.08113


Category 5: Camera-LiDAR Fusion

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation Zhijian Liu et al. (MIT HAN Lab, NeurIPS 2022 / ICRA 2023)arXiv:2205.13542


Category 6: Open-Vocabulary & Foundation Model Segmentation

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation Zihao Xiao, Longlong Jing, …, Wei-Chih Hung, Thomas Funkhouser, et al. (JHU/Waymo/Google, ECCV 2024)arXiv:2401.02402

SAM (Segment Anything Model) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick (Meta AI, ICCV 2023 Best Paper Honorable Mention)arXiv:2304.02643

SAM 2: Segment Anything in Images and Videos Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Radle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, Christoph Feichtenhofer (Meta AI, 2024)GitHub


Category 7: 3D Occupancy Prediction

TPVFormer: Tri-Perspective View for 3D Semantic Scene Completion Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, Jiwen Lu (THU, CVPR 2023)GitHub

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu (THU, ICCV 2023)GitHub


Category 8: Evaluation Metrics (Waymo-Specific)

LET-3D-AP: Longitudinal Error Tolerant 3D Average Precision for Camera-Only 3D Detection Wei-Chih Hung, Vincent Casser, Henrik Kretzschmar, Jyh-Jing Hwang, Dragomir Anguelov (Waymo, ECCV 2022 / ICRA 2024)arXiv:2206.07705


Key Concepts & Terminology

Term Definition
Panoptic Segmentation Joint semantic + instance segmentation. Every pixel gets a class label; “thing” pixels also get instance IDs.
PQ (Panoptic Quality) Standard metric = SQ (Segmentation Quality) x RQ (Recognition Quality). Evaluates both segmentation accuracy and recognition.
VPQ (Video Panoptic Quality) Extends PQ to video by evaluating temporal consistency of predictions across frames.
STQ (Segmentation and Tracking Quality) Metric for video panoptic seg that separately evaluates segmentation and tracking quality. Used by Waymo.
BEV (Bird’s-Eye View) Top-down 2D representation of 3D scene. Common intermediate representation for multi-camera fusion.
Occupancy Prediction Predict whether each 3D voxel in the scene is occupied and its semantic class. More general than bounding boxes.
Open-Vocabulary Segmentation Segmenting classes not seen during training, using vision-language alignment (e.g., CLIP).
Things vs. Stuff “Things” = countable objects (cars, people); “Stuff” = amorphous regions (road, sky, vegetation).
MLLM Multimodal Large Language Model (e.g., Gemini, GPT-4V). EMMA is built on this foundation.
Distillation Transferring knowledge from a large/pretrained model (teacher) to a task-specific model (student).
LET-3D-AP Waymo’s metric that tolerates depth errors in camera-only detection if they don’t affect safety.

Research Frontier: Open Problems & Active Directions

1. Unified Perception-Planning Models

2. Open-Vocabulary & Long-Tail Perception

3. Temporal Consistency & Video Understanding

4. Camera-Only vs. Multi-Modal

5. 3D Occupancy as Alternative to Bounding Boxes

6. Foundation Models for Driving

7. World Models & Simulation


For maximum understanding before the interview, read in this order:

  1. Panoptic Segmentation (Kirillov et al., 2019) – understand the foundational task definition and PQ metric
  2. Mask2Former (Cheng et al., 2022) – the dominant universal segmentation architecture
  3. BEVFusion (Liu et al., 2022) – how camera + LiDAR fusion works in practice
  4. Waymo Panoramic Video Panoptic Segmentation (Mei et al., ECCV 2022) – Waymo’s own benchmark; know the dataset
  5. LET-3D-AP (Hung et al., 2022) – Wei-Chih’s work on evaluation metrics; understand the philosophy
  6. UniAD (Hu et al., CVPR 2023) – the pre-EMMA paradigm for end-to-end driving
  7. 3D Open-Vocabulary Panoptic Segmentation (Xiao, Hung et al., ECCV 2024) – Wei-Chih’s most relevant recent paper; read carefully
  8. STT: Stateful Tracking with Transformers (ICRA 2024) – tracking side of Hung’s work
  9. EMMA (Hwang, Hung et al., 2024) – the seed paper; read the full paper
  10. SAM 2 (Meta, 2024) – understand foundation model capabilities for driving video
  11. S4-Driver (Xie, Xu et al., Waymo, CVPR 2025) – self-supervised E2E driving; shows annotation-free approaches closing the gap with supervised methods

Interview-Specific Preparation Notes

Questions to be ready for:

Questions you might ask:

Key Waymo Research themes to demonstrate awareness of:

  1. Multi-camera consistency – Waymo cares about consistency across 5 cameras and over time
  2. Metric design – LET-3D-AP shows they think carefully about what to measure
  3. Open-vocabulary generalization – the ECCV 2024 paper signals this is a priority
  4. End-to-end unification – EMMA represents a bold move toward MLLMs for driving
  5. 2025 Open Dataset Challenges – Vision-based E2E driving, long-tail scenarios, interaction prediction

Topic Video Channel Link
BEVFusion BEVFusion: Multi-Task Multi-Sensor Fusion MIT HAN Lab https://www.youtube.com/watch?v=uCAka90si9E
UniAD UniAD: Planning-oriented Autonomous Driving OpenDriveLab https://www.youtube.com/watch?v=cyrxJJ_nnaQ
SAM Segment Anything Paper Explained No Hype AI https://www.youtube.com/watch?v=JUMmqX-EHMY
E2E Autonomous Driving Common Misconceptions in Autonomous Driving (Andreas Geiger) WAD at CVPR https://www.youtube.com/watch?v=x_42Fji1Z2M
Tesla Occupancy Networks AI for Full Self-Driving (Andrej Karpathy, CVPR 2021) WAD at CVPR https://www.youtube.com/watch?v=g6bOwQdCJrc
E2E AD Tutorial End-to-end Autonomous Driving: Past, Current and Onwards OpenDriveLab https://youtu.be/Z4n1vlAYqRw

Sources