Panoptic Segmentation: A First-Principles Guide

March 29, 2026

Panoptic Segmentation: A First-Principles Guide

Target audience: ML practitioners familiar with CNNs and transformers who want to understand how scene segmentation evolved from separate tasks into a unified framework – and where it’s heading next.

Overview
Timeline & Evolution
The Two Predecessors: Semantic and Instance Segmentation
Panoptic Segmentation: Unifying the Picture
The Panoptic Quality Metric
First-Generation Architectures: Two Branches, One Merge
The Transformer Revolution: From DETR to Mask2Former
Open-Vocabulary Panoptic Segmentation
Video Panoptic Segmentation
Summary
Key References

Key terms introduced in this post: semantic segmentation, instance segmentation, panoptic segmentation, stuff classes, thing classes, Panoptic Quality (PQ), Segmentation Quality (SQ), Recognition Quality (RQ), Feature Pyramid Network (FPN), heuristic merge, fully convolutional network (FCN), atrous convolutions, DETR, Hungarian matching, learnable queries, mask classification, MaskFormer, Mask2Former, masked attention, open-vocabulary segmentation, CLIP, vision-language alignment, Video Panoptic Quality (VPQ), segment tubes, query propagation

Overview

Every pixel in an image belongs to something. It might be part of a car, a person, or the road surface. For a machine to truly “understand” a scene, it needs to label every single pixel – identifying both what it is and, for countable objects, which one it is.

For years, the computer vision community attacked this problem with two separate tasks. Semantic segmentation assigned a class label to every pixel but could not distinguish between two people standing side by side. Instance segmentation detected and masked individual objects but ignored amorphous regions like sky, road, and grass. Neither task alone provided a complete scene description.

In 2019, Kirillov et al. introduced panoptic segmentation as a unified task that requires a model to label every pixel with both a class and an instance identity. This simple reframing triggered a wave of architectural innovation – from bolted-together two-branch systems, through transformer-based unified architectures, to today’s open-vocabulary models that can segment categories they have never seen during training and video models that track segments across time.

The field has undergone three major paradigm shifts:

Separate tasks to unified task (2019): Panoptic segmentation defined a single benchmark that required both stuff and thing understanding.
Two-branch heuristic merge to unified queries (2020-2022): Transformer decoders replaced the separate semantic/instance heads with a single set of learned queries.
Closed-vocabulary to open-vocabulary (2023-present): Vision-language models like CLIP replaced fixed classification heads, enabling segmentation of arbitrary categories described in natural language.

This post walks through each of these shifts from first principles.

Timeline & Evolution

Year	Paper / Method	Key Innovation
2015	FCN (Long et al.)	First end-to-end semantic segmentation with fully convolutional networks
2017	Mask R-CNN (He et al.)	Added a mask branch to Faster R-CNN for instance segmentation
2017	DeepLab v3+ (Chen et al.)	Atrous convolutions + encoder-decoder for strong semantic segmentation
2019	Panoptic Segmentation (Kirillov et al.)	Defined the panoptic task and the PQ metric
2019	Panoptic FPN (Kirillov et al.)	First strong baseline: FPN backbone with semantic + instance heads
2019	UPSNet (Xiong et al.)	Learnable panoptic head replacing heuristic merge
2020	DETR (Carion et al.)	Set prediction with transformers – eliminated hand-crafted components like NMS
2020	Panoptic-DeepLab (Cheng et al.)	Bottom-up panoptic segmentation without box detection
2021	MaskFormer (Cheng et al.)	Reframed semantic segmentation as mask classification
2021	K-Net (Zhang et al.)	Dynamic kernels as unified segment representations
2022	Mask2Former (Cheng et al.)	Masked attention + multi-scale features; unified architecture for all segmentation tasks
2023	ODISE (Xu et al.)	Open-vocabulary panoptic seg using diffusion model features + CLIP
2023	FC-CLIP (Yu et al.)	Simplified open-vocab panoptic seg with frozen CLIP backbone
2023	Tarvis (Athar et al.)	Unified video segmentation across multiple tasks
2024	Video-kMaX (Shin et al.)	Clip-level video panoptic segmentation with k-means attention
2024	DVIS++ (Zhang et al.)	Decoupled video instance segmentation extended to video panoptic

1. The Two Predecessors: Semantic and Instance Segmentation

To understand why panoptic segmentation exists, you first need to understand what it replaced – and why neither predecessor was sufficient on its own.

Semantic segmentation: label every pixel

Semantic segmentation assigns a class label to every pixel in an image. The output is a dense map with the same spatial dimensions as the input: H x W, where each entry is an integer class ID.

The breakthrough came in 2015 when Long et al. showed that a classification network (VGG, later ResNet) could be converted into a fully convolutional network (FCN) by replacing the final fully connected layers with convolutional ones. The key insight: if you remove the global average pooling, a CNN naturally produces a spatial map of class predictions – one per spatial location.

The problem is resolution. Successive pooling layers shrink the feature map, so the raw output is coarse (e.g., 1/32 of input resolution). The field spent years recovering fine spatial detail:

Atrous (dilated) convolutions (DeepLab family): increase the receptive field without reducing resolution by inserting gaps in the convolution kernel
Encoder-decoder architectures (U-Net, DeepLab v3+): a contracting path extracts context, an expanding path recovers spatial precision via skip connections
Multi-scale feature fusion (FPN, PSPNet): combine features from multiple resolutions to capture both local detail and global context

The loss function is straightforward: per-pixel cross-entropy between the predicted class map and the ground truth.

Semantic segmentation handles stuff classes beautifully – sky, road, grass, water – amorphous regions that do not have countable instances. But it fundamentally cannot distinguish individual objects. If three cars are parked next to each other, every pixel in all three gets the label “car.” A self-driving system that cannot tell where one car ends and the next begins has a serious problem.

Instance segmentation: detect and mask individual objects

Instance segmentation takes the opposite approach. Rather than labeling every pixel, it detects individual object instances and generates a binary mask for each one.

The dominant paradigm is top-down (detect-then-segment): a detector like Faster R-CNN first proposes bounding boxes, and then a lightweight mask head predicts a binary mask within each box. Mask R-CNN (2017) is the canonical example – it adds a small FCN branch in parallel with the existing classification and box regression heads. For each detected box, it predicts a 28 x 28 binary mask, which is then resized to the box dimensions and pasted onto the image.

The key architectural insight of Mask R-CNN is decoupling classification from mask prediction. The mask branch predicts a class-agnostic binary mask per RoI (or one mask per class), while the classification head decides what the object is. This separation works because the mask task is mostly about spatial extent, not semantics.

But instance segmentation only covers things – countable objects with well-defined boundaries (cars, people, dogs). It says nothing about the 60-70% of pixels that belong to stuff classes. An instance segmentation model looking at a street scene will detect the cars and pedestrians but leave the road, sidewalk, buildings, and sky completely unlabeled.

Semantic segmentation labels every pixel by class but cannot distinguish individual instances, while instance segmentation identifies individual objects but ignores background regions.

This is the gap that panoptic segmentation fills: a complete scene description requires labeling every pixel (like semantic segmentation) while also distinguishing individual instances of countable objects (like instance segmentation).

2. Panoptic Segmentation: Unifying the Picture

In 2019, Kirillov et al. formalized the panoptic segmentation task. The definition is deceptively simple: assign every pixel in an image a pair (class_id, instance_id).

The task draws a clean line between two kinds of categories:

Things (countable objects): cars, people, animals. Each instance gets a unique ID. Two cars in the same image are both class “car” but have different instance IDs.
Stuff (amorphous regions): sky, road, grass, wall. These classes do not have countable instances – all “road” pixels share a single segment.

The panoptic map is encoded as a single integer per pixel:

panoptic_id = class_id * OFFSET + instance_id

where OFFSET is a constant larger than the maximum number of instances (typically 1000). This encoding packs class and instance into one value, making the output a single H x W integer map where every pixel is accounted for. There are no gaps (every pixel must belong to exactly one segment) and no overlaps (unlike instance segmentation, which allows overlapping masks).

A panoptic segmentation output assigns every pixel exactly one (class_id, instance_id) pair, with stuff classes getting instance_id = 0 and each thing instance receiving a unique ID.

Why “panoptic”?

The word comes from the Greek panoptikos – “all-seeing.” The name was chosen deliberately: the task demands a complete understanding of the scene, not just objects or just background, but everything.

The stuff-things distinction matters architecturally

This is not just a taxonomic detail – it has deep architectural implications. Stuff and things have fundamentally different statistical properties:

Property	Stuff	Things
Shape	Amorphous, large	Well-defined, compact
Count	One region per class	Multiple instances per class
Best features	High-level context (large receptive field)	Fine boundaries + detection
Traditional approach	Per-pixel classification	Detect-then-segment

This asymmetry is why the first generation of panoptic models used two separate branches – and why unifying them into a single architecture was a significant research challenge.

3. The Panoptic Quality Metric

You cannot optimize what you cannot measure. Kirillov et al. introduced Panoptic Quality (PQ), a metric that evaluates both the segmentation quality and the recognition quality in a single number.

PQ works by first matching predicted segments to ground truth segments. A prediction and a ground truth segment are considered a match (true positive) if their intersection-over-union (IoU) exceeds 0.5. This threshold is chosen carefully – at IoU > 0.5, each predicted segment can match at most one ground truth segment and vice versa, making the matching unique.

PQ decomposes cleanly into two factors:

PQ = SQ * RQ

where:

Segmentation Quality (SQ) = average IoU of matched segments. This measures how well the model delineates each segment.
Recognition Quality (RQ) = TP / (TP + 0.5 * FP + 0.5 * FN). This is essentially an F1 score that measures how many segments are correctly found.

The variables:

TP: true positives (matched pairs with IoU > 0.5)
FP: false positives (predicted segments with no match)
FN: false negatives (ground truth segments with no match)

PQ is computed per-class and then averaged. It is reported as PQ, PQ-Things, and PQ-Stuff to separate performance on the two category types.

The beauty of PQ is its interpretability. If PQ is low, you can look at SQ and RQ separately to diagnose whether the problem is poor mask quality (low SQ) or missing/hallucinated segments (low RQ). This is more informative than a single mIoU number that conflates both failure modes.

4. First-Generation Architectures: Two Branches, One Merge

The first panoptic segmentation models took the obvious approach: run an existing semantic segmentation model and an existing instance segmentation model, then merge their outputs.

Panoptic FPN (2019)

Panoptic FPN (Kirillov et al., 2019) extended Mask R-CNN with a semantic segmentation branch. The architecture shares a Feature Pyramid Network (FPN) backbone and adds two task-specific heads:

Instance head: The standard Mask R-CNN head – region proposals, RoI pooling, box regression, classification, and mask prediction per instance.
Semantic head: A lightweight FCN that takes FPN features from multiple levels, upsamples them to a common resolution, sums them, and predicts per-pixel class logits.

The two outputs must be merged into a single panoptic map. This heuristic merge procedure works as follows:

Start with the instance predictions, sorted by confidence score.
For each instance, paste its mask onto the panoptic canvas if it does not overlap too much with already-placed instances (IoU threshold, typically 0.5).
For remaining unassigned pixels, fill in the semantic prediction, but only for stuff classes.
Remove stuff segments smaller than a threshold area.

This merge is brittle. It relies on hand-tuned thresholds for overlap resolution, minimum area, and confidence filtering. Conflicts between the semantic and instance heads – where one says “road” and the other says “car” for the same pixel – are resolved by heuristic priority rules rather than learned reasoning.

UPSNet and Panoptic-DeepLab

UPSNet (Xiong et al., 2019) tried to improve on the heuristic merge by introducing a learnable panoptic head that takes both semantic and instance features and produces the final panoptic map via learned attention. This was an early attempt at making the merge differentiable.

Panoptic-DeepLab (Cheng et al., 2020) took a different path entirely – a bottom-up approach. Instead of detecting objects with bounding boxes, it predicted:

Semantic segmentation (per-pixel class labels)
Instance center heatmaps (where are the object centers?)
Center offsets (for each pixel, a 2D vector pointing to its instance center)

Grouping pixels into instances is done by following the offset vectors to cluster pixels around predicted centers. This avoids the detector entirely, removing the need for anchor boxes, NMS, and RoI operations. The downside is that the grouping step itself can be fragile, especially for large or heavily occluded objects.

Despite their differences, all first-generation models shared a fundamental limitation: they treated stuff and things as separate problems requiring separate processing pipelines, with some reconciliation step at the end.

5. The Transformer Revolution: From DETR to Mask2Former

The biggest paradigm shift in panoptic segmentation came from an unexpected direction: treating segmentation as a set prediction problem.

DETR: the foundation (2020)

DETR (DEtection TRansformer, Carion et al., 2020) reimagined object detection as direct set prediction. Instead of generating thousands of candidate boxes and filtering them with NMS, DETR uses a transformer decoder with a fixed set of N learnable queries (typically N=100). Each query attends to the image features and outputs a prediction: either an object (class + bounding box) or “no object.”

The training uses Hungarian matching – a bipartite matching algorithm that finds the optimal one-to-one assignment between predictions and ground truth, then computes the loss only on matched pairs. This eliminates the need for NMS entirely.

DETR itself was an object detector, but the query-based paradigm opened a door. If each query can predict a bounding box, why not a mask?

MaskFormer: segmentation as mask classification (2021)

MaskFormer (Cheng et al., 2021) made a conceptual leap. Traditional semantic segmentation is framed as per-pixel classification – each pixel independently chooses a class. MaskFormer reframed it as mask classification – predict a set of masks, classify each mask, and compose them to get the final segmentation.

The architecture:

A backbone (e.g., ResNet, Swin Transformer) extracts image features.
A pixel decoder (e.g., FPN) produces high-resolution per-pixel embeddings.
A transformer decoder takes N learnable queries and, through cross-attention with the image features, outputs N mask embeddings and N class predictions.
Each mask is computed as a dot product between the mask embedding and the per-pixel embeddings, producing a binary mask at full resolution.

The final panoptic map is assembled by assigning each pixel to the highest-confidence mask that covers it.

The key insight: this architecture is task-agnostic. The same N queries can represent stuff segments or thing instances. There is no separate “semantic head” or “instance head” – every query works the same way, and the distinction between stuff and things only appears in the class labels. A query that predicts class “road” produces a stuff segment; a query that predicts class “car” produces a thing instance. The model itself does not need to know the difference.

Mask2Former: the current workhorse (2022)

Mask2Former (Cheng et al., 2022) refined MaskFormer with two crucial technical improvements:

Masked attention: Instead of each query attending to all spatial locations (global cross-attention), each query only attends to the spatial region where its predicted mask from the previous decoder layer has high probability. This is more efficient and provides a strong inductive bias – a “car” query focuses on car-shaped regions, not the entire image.
Multi-scale features: The transformer decoder processes features at multiple resolutions in sequence (1/32 -> 1/16 -> 1/8), progressively refining the masks from coarse to fine.

Mask2Former achieved state-of-the-art results on panoptic, instance, and semantic segmentation simultaneously – the first single architecture to top all three leaderboards. It established that the artificial separation between segmentation tasks was an artifact of our formulations, not a fundamental property of the problem.

The evolution from two-branch architectures with heuristic merge to unified transformer decoders where a single set of learned queries produces all segments.

Why did transformers win here?

The success of transformer-based panoptic models comes down to three properties:

Set prediction naturally handles variable numbers of objects. The Hungarian matching loss allows the model to predict any number of segments (up to N) without requiring NMS or anchor boxes.
Self-attention between queries enables competition. Queries can learn to specialize – one query takes the left car, another takes the right car – because they can “see” each other through self-attention.
The query abstraction unifies stuff and things. A query is just a learned vector that gets refined through attention. It does not care whether it will end up representing a sky region or a person.

6. Open-Vocabulary Panoptic Segmentation

All models discussed so far operate in a closed-vocabulary setting: the set of recognizable classes is fixed at training time. A model trained on COCO’s 133 panoptic categories cannot segment a “skateboard ramp” or “solar panel” – it has no output neuron for those classes.

Open-vocabulary panoptic segmentation breaks this constraint by replacing the fixed classification head with a vision-language alignment mechanism.

The core idea

Instead of a learned weight matrix W of shape [num_classes, embed_dim] that maps mask embeddings to class logits, open-vocabulary models use a text encoder (typically from CLIP) to generate class embeddings on the fly:

Closed-vocab:   logits = mask_embedding @ W.T           (W is fixed, learned during training)
Open-vocab:     logits = mask_embedding @ text_embeds.T  (text_embeds come from any text)

At inference time, you can provide any list of class names as text prompts. The text encoder converts each name into an embedding vector, and classification becomes a dot product between the mask embedding and each text embedding. If the mask embedding for a segment is closest to “solar panel” in CLIP’s shared embedding space, that is its predicted class – even though the model never saw a “solar panel” during segmentation training.

Key methods

ODISE (Xu et al., 2023) was one of the first strong open-vocabulary panoptic models. It leveraged an interesting insight: the internal features of a pre-trained text-to-image diffusion model (Stable Diffusion) contain rich semantic information about visual concepts. ODISE extracts features from the diffusion model’s UNet, combines them with CLIP embeddings, and feeds both into a Mask2Former-style decoder. The diffusion features capture fine-grained visual patterns, while CLIP provides the vision-language alignment needed for open-vocabulary classification.

FC-CLIP (Yu et al., 2023) showed that a much simpler approach also works well. It uses a single frozen CLIP backbone (ConvNeXt variant) for both feature extraction and classification, with a lightweight Mask2Former decoder on top. The frozen CLIP features already contain enough information for mask prediction – no diffusion model needed. FC-CLIP demonstrated that the key ingredient is a strong pre-trained vision-language backbone, not architectural complexity.

SAN (Xu et al., 2023) introduced side adapter networks that attach lightweight adapters to a frozen CLIP model, enabling it to produce dense predictions while preserving its open-vocabulary capabilities.

Open-vocabulary panoptic segmentation replaces fixed classification heads with vision-language alignment, enabling the model to segment categories described by arbitrary text prompts.

The training challenge

Open-vocabulary models face a bootstrapping problem: how do you train a segmentation model on a fixed dataset (e.g., COCO with 133 classes) and have it generalize to thousands of unseen classes?

The key strategies are:

Freeze the CLIP backbone (or use very low learning rates) to preserve its open-vocabulary knowledge. If you fine-tune CLIP aggressively on 133 classes, it “forgets” the other concepts it learned from 400 million image-text pairs.
Train the mask prediction and classification separately. The mask decoder learns to produce good spatial masks using standard segmentation data. The classification leverages CLIP’s zero-shot transfer. The two capabilities are combined at inference.
Use large-vocabulary detection data (e.g., LVIS with 1203 classes, or Objects365) during training to expose the model to more categories, even if the panoptic annotations are limited.

Current open-vocabulary panoptic models achieve competitive performance on the training categories (within a few PQ points of closed-vocabulary specialists) while also performing well on novel categories – a significant advancement toward general-purpose scene understanding.

7. Video Panoptic Segmentation

Image-level panoptic segmentation assigns every pixel a (class, instance) label in a single frame. Video panoptic segmentation (VPS) extends this to video: the model must produce a panoptic map for every frame and maintain consistent instance IDs across time.

If person #3 appears in frame 10, the model must recognize that the same person is still person #3 in frame 50, even if they were temporarily occluded in frames 30-35. This temporal consistency requirement transforms the problem from pure segmentation into segmentation + tracking.

The VPQ metric

Kim et al. (2020) introduced Video Panoptic Quality (VPQ), which extends image-level PQ to the temporal domain. Instead of matching individual segment masks, VPQ matches segment tubes – sequences of masks belonging to the same instance across k consecutive frames.

For a predicted tube and a ground truth tube to match, their tube IoU must exceed 0.5:

tube_IoU = (sum of per-frame intersection) / (sum of per-frame union)

VPQ penalizes both poor spatial segmentation (low IoU within each frame) and ID switches (breaking a tube into multiple fragments or merging different objects). It is typically reported at multiple temporal window sizes (k = 5, 10, 15) to evaluate both short-term and long-term consistency.

A supplementary metric, Segmentation and Tracking Quality (STQ) (Weber et al., 2021), provides an alternative decomposition that separates spatial segmentation accuracy from temporal association quality more cleanly.

Architectural approaches

Video panoptic architectures fall into three broad categories:

1. Online (frame-by-frame + association): Process each frame independently with an image-level panoptic model, then associate instances across frames using tracking. This is the simplest approach and the most memory-efficient, but temporal context is limited to the association mechanism.

ViP-DeepLab (Qiao et al., 2021) extends Panoptic-DeepLab with a next-frame instance prediction head that directly estimates how instances move between consecutive frames.
Tracking-by-attention methods use the transformer queries themselves as instance memory – the same query vector that represents “person #3” in frame t is carried forward and refined in frame t+1.

2. Near-online (clip-based): Process short clips (e.g., 4-8 frames) jointly to leverage temporal context within the clip, then link instances across clips.

Video K-Net and Video-kMaX (Shin et al., 2024) extend the kernel/query-based framework to short clips. Instance queries attend to multi-frame features, producing temporally coherent masks within each clip.
TarViS (Athar et al., 2023) proposes a unified architecture for multiple video segmentation tasks (VPS, VIS, VOS) using task-specific prompts with a shared transformer decoder.

3. Offline (full video): Process the entire video at once (or in long overlapping windows). This gives the most temporal context but is computationally expensive and impractical for real-time applications.

Video panoptic segmentation extends the per-frame panoptic task with temporal consistency, requiring models to track each instance's identity as it moves, appears, disappears, and reappears across frames.

The tracking-segmentation tension

Video panoptic segmentation reveals a fundamental tension: segmentation wants per-frame accuracy (sharp boundaries, correct classification), while tracking wants temporal smoothness (consistent IDs, no flickering). These objectives can conflict – a model might produce better per-frame masks by re-segmenting from scratch each frame, but this risks ID inconsistency. Conversely, propagating masks temporally maintains IDs but can accumulate drift and boundary errors.

The best current methods address this by using query propagation with periodic re-detection – instance queries are propagated across frames for consistency, but the model also attends to fresh image features each frame to correct drift. This balance between memory (temporal propagation) and perception (per-frame extraction) is a key design axis.

Current state and open challenges

Video panoptic segmentation remains significantly harder than its image counterpart. On the Cityscapes-VPS and VIPSeg benchmarks, even the best methods leave substantial room for improvement, particularly for:

Long-range re-identification: Recognizing an instance that was occluded for dozens or hundreds of frames
Fast motion and deformation: Maintaining consistent IDs when objects move rapidly or change appearance (e.g., a person turning around)
Stuff consistency: Ensuring that stuff regions like “road” maintain coherent boundaries across frames without flickering
Efficiency: Real-time video panoptic segmentation at high resolution remains an open challenge for autonomous driving and robotics applications

Summary

The journey from separate semantic and instance segmentation to unified video panoptic segmentation follows a clear arc of increasing ambition:

Semantic segmentation (2015+) solved per-pixel classification but could not distinguish individual objects.
Instance segmentation (2017+) solved object-level detection and masking but ignored background regions.
Panoptic segmentation (2019) unified both tasks with a clean formulation: every pixel gets a (class, instance) label. First-generation models bolted together existing semantic and instance pipelines with heuristic merging.
Transformer-based unification (2020-2022) replaced the two-branch paradigm with learned queries that treat stuff and things identically. Mask2Former became the universal architecture.
Open-vocabulary extension (2023+) replaced fixed classifiers with vision-language alignment, enabling segmentation of arbitrary categories through text prompts.
Video extension (2020+) added temporal consistency, requiring models to track instance identities across frames – merging segmentation with multi-object tracking.

The current frontier is converging these advances: models that can perform open-vocabulary video panoptic segmentation in real time – understanding every pixel, every object, every frame, for any category you can name. This remains an open challenge, but the architectural foundations – transformer decoders with learned queries, vision-language backbones, and temporal attention – are firmly in place.

Key References

Year	Paper	Contribution
2015	Long et al., “Fully Convolutional Networks for Semantic Segmentation”	Established FCN paradigm for dense per-pixel prediction
2017	He et al., “Mask R-CNN”	Added instance mask branch to Faster R-CNN; dominant instance segmentation approach for years
2019	Kirillov et al., “Panoptic Segmentation”	Defined the panoptic task, the stuff/things distinction, and the PQ metric
2019	Kirillov et al., “Panoptic Feature Pyramid Networks”	First competitive panoptic baseline with FPN backbone
2020	Carion et al., “End-to-End Object Detection with Transformers (DETR)”	Introduced query-based set prediction that would reshape all segmentation tasks
2020	Cheng et al., “Panoptic-DeepLab”	Box-free bottom-up panoptic segmentation
2020	Kim et al., “Video Panoptic Segmentation”	Defined the VPS task and VPQ metric
2021	Cheng et al., “Per-Pixel Classification is Not All You Need (MaskFormer)”	Reframed segmentation as mask classification; unified stuff and things architecturally
2022	Cheng et al., “Masked-attention Mask Transformer (Mask2Former)”	Masked cross-attention + multi-scale decoding; state-of-the-art across all segmentation tasks
2023	Xu et al., “Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models (ODISE)”	Leveraged diffusion model features for open-vocabulary panoptic segmentation
2023	Yu et al., “Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen CLIP (FC-CLIP)”	Simple, effective open-vocab panoptic seg with frozen ConvNeXt-CLIP
2023	Athar et al., “TarViS: A Unified Architecture for Target-based Video Segmentation”	Unified architecture for VPS, VIS, and VOS with task prompts
2024	Shin et al., “Video-kMaX”	Clip-level video panoptic segmentation with k-means cross-attention

Panoptic Segmentation: A First-Principles Guide

Panoptic Segmentation: A First-Principles Guide

Table of Contents

Overview

Timeline & Evolution

1. The Two Predecessors: Semantic and Instance Segmentation

Semantic segmentation: label every pixel

Instance segmentation: detect and mask individual objects

2. Panoptic Segmentation: Unifying the Picture

Why “panoptic”?

The stuff-things distinction matters architecturally

3. The Panoptic Quality Metric

4. First-Generation Architectures: Two Branches, One Merge

Panoptic FPN (2019)

UPSNet and Panoptic-DeepLab

5. The Transformer Revolution: From DETR to Mask2Former

DETR: the foundation (2020)

MaskFormer: segmentation as mask classification (2021)

Mask2Former: the current workhorse (2022)

Why did transformers win here?

6. Open-Vocabulary Panoptic Segmentation

The core idea

Key methods

The training challenge

7. Video Panoptic Segmentation

The VPQ metric

Architectural approaches

The tracking-segmentation tension

Current state and open challenges

Summary

Key References