Panoptic Segmentation: A First-Principles Guide
March 29, 2026
Panoptic Segmentation: A First-Principles Guide
Target audience: ML practitioners familiar with CNNs and transformers who want to understand how scene segmentation evolved from separate tasks into a unified framework – and where it’s heading next.
Table of Contents
- Overview
- Timeline & Evolution
- The Two Predecessors: Semantic and Instance Segmentation
- Panoptic Segmentation: Unifying the Picture
- The Panoptic Quality Metric
- First-Generation Architectures: Two Branches, One Merge
- The Transformer Revolution: From DETR to Mask2Former
- Open-Vocabulary Panoptic Segmentation
- Video Panoptic Segmentation
- Summary
- Key References
Key terms introduced in this post: semantic segmentation, instance segmentation, panoptic segmentation, stuff classes, thing classes, Panoptic Quality (PQ), Segmentation Quality (SQ), Recognition Quality (RQ), Feature Pyramid Network (FPN), heuristic merge, fully convolutional network (FCN), atrous convolutions, DETR, Hungarian matching, learnable queries, mask classification, MaskFormer, Mask2Former, masked attention, open-vocabulary segmentation, CLIP, vision-language alignment, Video Panoptic Quality (VPQ), segment tubes, query propagation
Overview
Every pixel in an image belongs to something. It might be part of a car, a person, or the road surface. For a machine to truly “understand” a scene, it needs to label every single pixel – identifying both what it is and, for countable objects, which one it is.
For years, the computer vision community attacked this problem with two separate tasks. Semantic segmentation assigned a class label to every pixel but could not distinguish between two people standing side by side. Instance segmentation detected and masked individual objects but ignored amorphous regions like sky, road, and grass. Neither task alone provided a complete scene description.
In 2019, Kirillov et al. introduced panoptic segmentation as a unified task that requires a model to label every pixel with both a class and an instance identity. This simple reframing triggered a wave of architectural innovation – from bolted-together two-branch systems, through transformer-based unified architectures, to today’s open-vocabulary models that can segment categories they have never seen during training and video models that track segments across time.
The field has undergone three major paradigm shifts:
- Separate tasks to unified task (2019): Panoptic segmentation defined a single benchmark that required both stuff and thing understanding.
- Two-branch heuristic merge to unified queries (2020-2022): Transformer decoders replaced the separate semantic/instance heads with a single set of learned queries.
- Closed-vocabulary to open-vocabulary (2023-present): Vision-language models like CLIP replaced fixed classification heads, enabling segmentation of arbitrary categories described in natural language.
This post walks through each of these shifts from first principles.
Timeline & Evolution
| Year | Paper / Method | Key Innovation |
|---|---|---|
| 2015 | FCN (Long et al.) | First end-to-end semantic segmentation with fully convolutional networks |
| 2017 | Mask R-CNN (He et al.) | Added a mask branch to Faster R-CNN for instance segmentation |
| 2017 | DeepLab v3+ (Chen et al.) | Atrous convolutions + encoder-decoder for strong semantic segmentation |
| 2019 | Panoptic Segmentation (Kirillov et al.) | Defined the panoptic task and the PQ metric |
| 2019 | Panoptic FPN (Kirillov et al.) | First strong baseline: FPN backbone with semantic + instance heads |
| 2019 | UPSNet (Xiong et al.) | Learnable panoptic head replacing heuristic merge |
| 2020 | DETR (Carion et al.) | Set prediction with transformers – eliminated hand-crafted components like NMS |
| 2020 | Panoptic-DeepLab (Cheng et al.) | Bottom-up panoptic segmentation without box detection |
| 2021 | MaskFormer (Cheng et al.) | Reframed semantic segmentation as mask classification |
| 2021 | K-Net (Zhang et al.) | Dynamic kernels as unified segment representations |
| 2022 | Mask2Former (Cheng et al.) | Masked attention + multi-scale features; unified architecture for all segmentation tasks |
| 2023 | ODISE (Xu et al.) | Open-vocabulary panoptic seg using diffusion model features + CLIP |
| 2023 | FC-CLIP (Yu et al.) | Simplified open-vocab panoptic seg with frozen CLIP backbone |
| 2023 | Tarvis (Athar et al.) | Unified video segmentation across multiple tasks |
| 2024 | Video-kMaX (Shin et al.) | Clip-level video panoptic segmentation with k-means attention |
| 2024 | DVIS++ (Zhang et al.) | Decoupled video instance segmentation extended to video panoptic |
1. The Two Predecessors: Semantic and Instance Segmentation
To understand why panoptic segmentation exists, you first need to understand what it replaced – and why neither predecessor was sufficient on its own.
Semantic segmentation: label every pixel
Semantic segmentation assigns a class label to every pixel in an image. The output is a dense map with the same spatial dimensions as the input: H x W, where each entry is an integer class ID.
The breakthrough came in 2015 when Long et al. showed that a classification network (VGG, later ResNet) could be converted into a fully convolutional network (FCN) by replacing the final fully connected layers with convolutional ones. The key insight: if you remove the global average pooling, a CNN naturally produces a spatial map of class predictions – one per spatial location.
The problem is resolution. Successive pooling layers shrink the feature map, so the raw output is coarse (e.g., 1/32 of input resolution). The field spent years recovering fine spatial detail:
- Atrous (dilated) convolutions (DeepLab family): increase the receptive field without reducing resolution by inserting gaps in the convolution kernel
- Encoder-decoder architectures (U-Net, DeepLab v3+): a contracting path extracts context, an expanding path recovers spatial precision via skip connections
- Multi-scale feature fusion (FPN, PSPNet): combine features from multiple resolutions to capture both local detail and global context
The loss function is straightforward: per-pixel cross-entropy between the predicted class map and the ground truth.
Semantic segmentation handles stuff classes beautifully – sky, road, grass, water – amorphous regions that do not have countable instances. But it fundamentally cannot distinguish individual objects. If three cars are parked next to each other, every pixel in all three gets the label “car.” A self-driving system that cannot tell where one car ends and the next begins has a serious problem.
Instance segmentation: detect and mask individual objects
Instance segmentation takes the opposite approach. Rather than labeling every pixel, it detects individual object instances and generates a binary mask for each one.
The dominant paradigm is top-down (detect-then-segment): a detector like Faster R-CNN first proposes bounding boxes, and then a lightweight mask head predicts a binary mask within each box. Mask R-CNN (2017) is the canonical example – it adds a small FCN branch in parallel with the existing classification and box regression heads. For each detected box, it predicts a 28 x 28 binary mask, which is then resized to the box dimensions and pasted onto the image.
The key architectural insight of Mask R-CNN is decoupling classification from mask prediction. The mask branch predicts a class-agnostic binary mask per RoI (or one mask per class), while the classification head decides what the object is. This separation works because the mask task is mostly about spatial extent, not semantics.
But instance segmentation only covers things – countable objects with well-defined boundaries (cars, people, dogs). It says nothing about the 60-70% of pixels that belong to stuff classes. An instance segmentation model looking at a street scene will detect the cars and pedestrians but leave the road, sidewalk, buildings, and sky completely unlabeled.
This is the gap that panoptic segmentation fills: a complete scene description requires labeling every pixel (like semantic segmentation) while also distinguishing individual instances of countable objects (like instance segmentation).
2. Panoptic Segmentation: Unifying the Picture
In 2019, Kirillov et al. formalized the panoptic segmentation task. The definition is deceptively simple: assign every pixel in an image a pair (class_id, instance_id).
The task draws a clean line between two kinds of categories:
- Things (countable objects): cars, people, animals. Each instance gets a unique ID. Two cars in the same image are both class “car” but have different instance IDs.
- Stuff (amorphous regions): sky, road, grass, wall. These classes do not have countable instances – all “road” pixels share a single segment.
The panoptic map is encoded as a single integer per pixel:
panoptic_id = class_id * OFFSET + instance_id
where OFFSET is a constant larger than the maximum number of instances (typically 1000). This encoding packs class and instance into one value, making the output a single H x W integer map where every pixel is accounted for. There are no gaps (every pixel must belong to exactly one segment) and no overlaps (unlike instance segmentation, which allows overlapping masks).
Why “panoptic”?
The word comes from the Greek panoptikos – “all-seeing.” The name was chosen deliberately: the task demands a complete understanding of the scene, not just objects or just background, but everything.
The stuff-things distinction matters architecturally
This is not just a taxonomic detail – it has deep architectural implications. Stuff and things have fundamentally different statistical properties:
| Property | Stuff | Things |
|---|---|---|
| Shape | Amorphous, large | Well-defined, compact |
| Count | One region per class | Multiple instances per class |
| Best features | High-level context (large receptive field) | Fine boundaries + detection |
| Traditional approach | Per-pixel classification | Detect-then-segment |
This asymmetry is why the first generation of panoptic models used two separate branches – and why unifying them into a single architecture was a significant research challenge.
3. The Panoptic Quality Metric
You cannot optimize what you cannot measure. Kirillov et al. introduced Panoptic Quality (PQ), a metric that evaluates both the segmentation quality and the recognition quality in a single number.
PQ works by first matching predicted segments to ground truth segments. A prediction and a ground truth segment are considered a match (true positive) if their intersection-over-union (IoU) exceeds 0.5. This threshold is chosen carefully – at IoU > 0.5, each predicted segment can match at most one ground truth segment and vice versa, making the matching unique.
PQ decomposes cleanly into two factors:
PQ = SQ * RQ
where:
- Segmentation Quality (SQ) = average IoU of matched segments. This measures how well the model delineates each segment.
- Recognition Quality (RQ) = TP / (TP + 0.5 * FP + 0.5 * FN). This is essentially an F1 score that measures how many segments are correctly found.
The variables:
TP: true positives (matched pairs with IoU > 0.5)FP: false positives (predicted segments with no match)FN: false negatives (ground truth segments with no match)
PQ is computed per-class and then averaged. It is reported as PQ, PQ-Things, and PQ-Stuff to separate performance on the two category types.
The beauty of PQ is its interpretability. If PQ is low, you can look at SQ and RQ separately to diagnose whether the problem is poor mask quality (low SQ) or missing/hallucinated segments (low RQ). This is more informative than a single mIoU number that conflates both failure modes.
4. First-Generation Architectures: Two Branches, One Merge
The first panoptic segmentation models took the obvious approach: run an existing semantic segmentation model and an existing instance segmentation model, then merge their outputs.
Panoptic FPN (2019)
Panoptic FPN (Kirillov et al., 2019) extended Mask R-CNN with a semantic segmentation branch. The architecture shares a Feature Pyramid Network (FPN) backbone and adds two task-specific heads:
- Instance head: The standard Mask R-CNN head – region proposals, RoI pooling, box regression, classification, and mask prediction per instance.
- Semantic head: A lightweight FCN that takes FPN features from multiple levels, upsamples them to a common resolution, sums them, and predicts per-pixel class logits.
The two outputs must be merged into a single panoptic map. This heuristic merge procedure works as follows:
- Start with the instance predictions, sorted by confidence score.
- For each instance, paste its mask onto the panoptic canvas if it does not overlap too much with already-placed instances (IoU threshold, typically 0.5).
- For remaining unassigned pixels, fill in the semantic prediction, but only for stuff classes.
- Remove stuff segments smaller than a threshold area.
This merge is brittle. It relies on hand-tuned thresholds for overlap resolution, minimum area, and confidence filtering. Conflicts between the semantic and instance heads – where one says “road” and the other says “car” for the same pixel – are resolved by heuristic priority rules rather than learned reasoning.
UPSNet and Panoptic-DeepLab
UPSNet (Xiong et al., 2019) tried to improve on the heuristic merge by introducing a learnable panoptic head that takes both semantic and instance features and produces the final panoptic map via learned attention. This was an early attempt at making the merge differentiable.
Panoptic-DeepLab (Cheng et al., 2020) took a different path entirely – a bottom-up approach. Instead of detecting objects with bounding boxes, it predicted:
- Semantic segmentation (per-pixel class labels)
- Instance center heatmaps (where are the object centers?)
- Center offsets (for each pixel, a 2D vector pointing to its instance center)
Grouping pixels into instances is done by following the offset vectors to cluster pixels around predicted centers. This avoids the detector entirely, removing the need for anchor boxes, NMS, and RoI operations. The downside is that the grouping step itself can be fragile, especially for large or heavily occluded objects.
Despite their differences, all first-generation models shared a fundamental limitation: they treated stuff and things as separate problems requiring separate processing pipelines, with some reconciliation step at the end.
5. The Transformer Revolution: From DETR to Mask2Former
The biggest paradigm shift in panoptic segmentation came from an unexpected direction: treating segmentation as a set prediction problem.
DETR: the foundation (2020)
DETR (DEtection TRansformer, Carion et al., 2020) reimagined object detection as direct set prediction. Instead of generating thousands of candidate boxes and filtering them with NMS, DETR uses a transformer decoder with a fixed set of N learnable queries (typically N=100). Each query attends to the image features and outputs a prediction: either an object (class + bounding box) or “no object.”
The training uses Hungarian matching – a bipartite matching algorithm that finds the optimal one-to-one assignment between predictions and ground truth, then computes the loss only on matched pairs. This eliminates the need for NMS entirely.
DETR itself was an object detector, but the query-based paradigm opened a door. If each query can predict a bounding box, why not a mask?
MaskFormer: segmentation as mask classification (2021)
MaskFormer (Cheng et al., 2021) made a conceptual leap. Traditional semantic segmentation is framed as per-pixel classification – each pixel independently chooses a class. MaskFormer reframed it as mask classification – predict a set of masks, classify each mask, and compose them to get the final segmentation.
The architecture:
- A backbone (e.g., ResNet, Swin Transformer) extracts image features.
- A pixel decoder (e.g., FPN) produces high-resolution per-pixel embeddings.
- A transformer decoder takes N learnable queries and, through cross-attention with the image features, outputs N mask embeddings and N class predictions.
- Each mask is computed as a dot product between the mask embedding and the per-pixel embeddings, producing a binary mask at full resolution.
The final panoptic map is assembled by assigning each pixel to the highest-confidence mask that covers it.
The key insight: this architecture is task-agnostic. The same N queries can represent stuff segments or thing instances. There is no separate “semantic head” or “instance head” – every query works the same way, and the distinction between stuff and things only appears in the class labels. A query that predicts class “road” produces a stuff segment; a query that predicts class “car” produces a thing instance. The model itself does not need to know the difference.
Mask2Former: the current workhorse (2022)
Mask2Former (Cheng et al., 2022) refined MaskFormer with two crucial technical improvements:
-
Masked attention: Instead of each query attending to all spatial locations (global cross-attention), each query only attends to the spatial region where its predicted mask from the previous decoder layer has high probability. This is more efficient and provides a strong inductive bias – a “car” query focuses on car-shaped regions, not the entire image.
-
Multi-scale features: The transformer decoder processes features at multiple resolutions in sequence (1/32 -> 1/16 -> 1/8), progressively refining the masks from coarse to fine.
Mask2Former achieved state-of-the-art results on panoptic, instance, and semantic segmentation simultaneously – the first single architecture to top all three leaderboards. It established that the artificial separation between segmentation tasks was an artifact of our formulations, not a fundamental property of the problem.
Why did transformers win here?
The success of transformer-based panoptic models comes down to three properties:
- Set prediction naturally handles variable numbers of objects. The Hungarian matching loss allows the model to predict any number of segments (up to N) without requiring NMS or anchor boxes.
- Self-attention between queries enables competition. Queries can learn to specialize – one query takes the left car, another takes the right car – because they can “see” each other through self-attention.
- The query abstraction unifies stuff and things. A query is just a learned vector that gets refined through attention. It does not care whether it will end up representing a sky region or a person.
6. Open-Vocabulary Panoptic Segmentation
All models discussed so far operate in a closed-vocabulary setting: the set of recognizable classes is fixed at training time. A model trained on COCO’s 133 panoptic categories cannot segment a “skateboard ramp” or “solar panel” – it has no output neuron for those classes.
Open-vocabulary panoptic segmentation breaks this constraint by replacing the fixed classification head with a vision-language alignment mechanism.
The core idea
Instead of a learned weight matrix W of shape [num_classes, embed_dim] that maps mask embeddings to class logits, open-vocabulary models use a text encoder (typically from CLIP) to generate class embeddings on the fly:
Closed-vocab: logits = mask_embedding @ W.T (W is fixed, learned during training)
Open-vocab: logits = mask_embedding @ text_embeds.T (text_embeds come from any text)
At inference time, you can provide any list of class names as text prompts. The text encoder converts each name into an embedding vector, and classification becomes a dot product between the mask embedding and each text embedding. If the mask embedding for a segment is closest to “solar panel” in CLIP’s shared embedding space, that is its predicted class – even though the model never saw a “solar panel” during segmentation training.
Key methods
ODISE (Xu et al., 2023) was one of the first strong open-vocabulary panoptic models. It leveraged an interesting insight: the internal features of a pre-trained text-to-image diffusion model (Stable Diffusion) contain rich semantic information about visual concepts. ODISE extracts features from the diffusion model’s UNet, combines them with CLIP embeddings, and feeds both into a Mask2Former-style decoder. The diffusion features capture fine-grained visual patterns, while CLIP provides the vision-language alignment needed for open-vocabulary classification.
FC-CLIP (Yu et al., 2023) showed that a much simpler approach also works well. It uses a single frozen CLIP backbone (ConvNeXt variant) for both feature extraction and classification, with a lightweight Mask2Former decoder on top. The frozen CLIP features already contain enough information for mask prediction – no diffusion model needed. FC-CLIP demonstrated that the key ingredient is a strong pre-trained vision-language backbone, not architectural complexity.
SAN (Xu et al., 2023) introduced side adapter networks that attach lightweight adapters to a frozen CLIP model, enabling it to produce dense predictions while preserving its open-vocabulary capabilities.
The training challenge
Open-vocabulary models face a bootstrapping problem: how do you train a segmentation model on a fixed dataset (e.g., COCO with 133 classes) and have it generalize to thousands of unseen classes?
The key strategies are:
- Freeze the CLIP backbone (or use very low learning rates) to preserve its open-vocabulary knowledge. If you fine-tune CLIP aggressively on 133 classes, it “forgets” the other concepts it learned from 400 million image-text pairs.
- Train the mask prediction and classification separately. The mask decoder learns to produce good spatial masks using standard segmentation data. The classification leverages CLIP’s zero-shot transfer. The two capabilities are combined at inference.
- Use large-vocabulary detection data (e.g., LVIS with 1203 classes, or Objects365) during training to expose the model to more categories, even if the panoptic annotations are limited.
Current open-vocabulary panoptic models achieve competitive performance on the training categories (within a few PQ points of closed-vocabulary specialists) while also performing well on novel categories – a significant advancement toward general-purpose scene understanding.
7. Video Panoptic Segmentation
Image-level panoptic segmentation assigns every pixel a (class, instance) label in a single frame. Video panoptic segmentation (VPS) extends this to video: the model must produce a panoptic map for every frame and maintain consistent instance IDs across time.
If person #3 appears in frame 10, the model must recognize that the same person is still person #3 in frame 50, even if they were temporarily occluded in frames 30-35. This temporal consistency requirement transforms the problem from pure segmentation into segmentation + tracking.
The VPQ metric
Kim et al. (2020) introduced Video Panoptic Quality (VPQ), which extends image-level PQ to the temporal domain. Instead of matching individual segment masks, VPQ matches segment tubes – sequences of masks belonging to the same instance across k consecutive frames.
For a predicted tube and a ground truth tube to match, their tube IoU must exceed 0.5:
tube_IoU = (sum of per-frame intersection) / (sum of per-frame union)
VPQ penalizes both poor spatial segmentation (low IoU within each frame) and ID switches (breaking a tube into multiple fragments or merging different objects). It is typically reported at multiple temporal window sizes (k = 5, 10, 15) to evaluate both short-term and long-term consistency.
A supplementary metric, Segmentation and Tracking Quality (STQ) (Weber et al., 2021), provides an alternative decomposition that separates spatial segmentation accuracy from temporal association quality more cleanly.
Architectural approaches
Video panoptic architectures fall into three broad categories:
1. Online (frame-by-frame + association): Process each frame independently with an image-level panoptic model, then associate instances across frames using tracking. This is the simplest approach and the most memory-efficient, but temporal context is limited to the association mechanism.
- ViP-DeepLab (Qiao et al., 2021) extends Panoptic-DeepLab with a next-frame instance prediction head that directly estimates how instances move between consecutive frames.
- Tracking-by-attention methods use the transformer queries themselves as instance memory – the same query vector that represents “person #3” in frame t is carried forward and refined in frame t+1.
2. Near-online (clip-based): Process short clips (e.g., 4-8 frames) jointly to leverage temporal context within the clip, then link instances across clips.
- Video K-Net and Video-kMaX (Shin et al., 2024) extend the kernel/query-based framework to short clips. Instance queries attend to multi-frame features, producing temporally coherent masks within each clip.
- TarViS (Athar et al., 2023) proposes a unified architecture for multiple video segmentation tasks (VPS, VIS, VOS) using task-specific prompts with a shared transformer decoder.
3. Offline (full video): Process the entire video at once (or in long overlapping windows). This gives the most temporal context but is computationally expensive and impractical for real-time applications.
The tracking-segmentation tension
Video panoptic segmentation reveals a fundamental tension: segmentation wants per-frame accuracy (sharp boundaries, correct classification), while tracking wants temporal smoothness (consistent IDs, no flickering). These objectives can conflict – a model might produce better per-frame masks by re-segmenting from scratch each frame, but this risks ID inconsistency. Conversely, propagating masks temporally maintains IDs but can accumulate drift and boundary errors.
The best current methods address this by using query propagation with periodic re-detection – instance queries are propagated across frames for consistency, but the model also attends to fresh image features each frame to correct drift. This balance between memory (temporal propagation) and perception (per-frame extraction) is a key design axis.
Current state and open challenges
Video panoptic segmentation remains significantly harder than its image counterpart. On the Cityscapes-VPS and VIPSeg benchmarks, even the best methods leave substantial room for improvement, particularly for:
- Long-range re-identification: Recognizing an instance that was occluded for dozens or hundreds of frames
- Fast motion and deformation: Maintaining consistent IDs when objects move rapidly or change appearance (e.g., a person turning around)
- Stuff consistency: Ensuring that stuff regions like “road” maintain coherent boundaries across frames without flickering
- Efficiency: Real-time video panoptic segmentation at high resolution remains an open challenge for autonomous driving and robotics applications
Summary
The journey from separate semantic and instance segmentation to unified video panoptic segmentation follows a clear arc of increasing ambition:
- Semantic segmentation (2015+) solved per-pixel classification but could not distinguish individual objects.
- Instance segmentation (2017+) solved object-level detection and masking but ignored background regions.
- Panoptic segmentation (2019) unified both tasks with a clean formulation: every pixel gets a (class, instance) label. First-generation models bolted together existing semantic and instance pipelines with heuristic merging.
- Transformer-based unification (2020-2022) replaced the two-branch paradigm with learned queries that treat stuff and things identically. Mask2Former became the universal architecture.
- Open-vocabulary extension (2023+) replaced fixed classifiers with vision-language alignment, enabling segmentation of arbitrary categories through text prompts.
- Video extension (2020+) added temporal consistency, requiring models to track instance identities across frames – merging segmentation with multi-object tracking.
The current frontier is converging these advances: models that can perform open-vocabulary video panoptic segmentation in real time – understanding every pixel, every object, every frame, for any category you can name. This remains an open challenge, but the architectural foundations – transformer decoders with learned queries, vision-language backbones, and temporal attention – are firmly in place.
Key References
| Year | Paper | Contribution |
|---|---|---|
| 2015 | Long et al., “Fully Convolutional Networks for Semantic Segmentation” | Established FCN paradigm for dense per-pixel prediction |
| 2017 | He et al., “Mask R-CNN” | Added instance mask branch to Faster R-CNN; dominant instance segmentation approach for years |
| 2019 | Kirillov et al., “Panoptic Segmentation” | Defined the panoptic task, the stuff/things distinction, and the PQ metric |
| 2019 | Kirillov et al., “Panoptic Feature Pyramid Networks” | First competitive panoptic baseline with FPN backbone |
| 2020 | Carion et al., “End-to-End Object Detection with Transformers (DETR)” | Introduced query-based set prediction that would reshape all segmentation tasks |
| 2020 | Cheng et al., “Panoptic-DeepLab” | Box-free bottom-up panoptic segmentation |
| 2020 | Kim et al., “Video Panoptic Segmentation” | Defined the VPS task and VPQ metric |
| 2021 | Cheng et al., “Per-Pixel Classification is Not All You Need (MaskFormer)” | Reframed segmentation as mask classification; unified stuff and things architecturally |
| 2022 | Cheng et al., “Masked-attention Mask Transformer (Mask2Former)” | Masked cross-attention + multi-scale decoding; state-of-the-art across all segmentation tasks |
| 2023 | Xu et al., “Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models (ODISE)” | Leveraged diffusion model features for open-vocabulary panoptic segmentation |
| 2023 | Yu et al., “Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen CLIP (FC-CLIP)” | Simple, effective open-vocab panoptic seg with frozen ConvNeXt-CLIP |
| 2023 | Athar et al., “TarViS: A Unified Architecture for Target-based Video Segmentation” | Unified architecture for VPS, VIS, and VOS with task prompts |
| 2024 | Shin et al., “Video-kMaX” | Clip-level video panoptic segmentation with k-means cross-attention |