Video VLMs as Judges on Waymo's E2E Driving Set: A First-Principles Walkthrough
April 27, 2026
Target audience: ML practitioners with general transformer/VLM background who want to know whether off-the-shelf video vision-language models can sit in front of a human-rater queue on autonomous-vehicle scenario data — and what predictive uncertainty actually buys you in that role.
Table of Contents
- Why this question matters
- The data: Waymo Open Dataset E2E challenge
- Stitching 8 cameras into one VLM input
- The judge setup — three video VLMs, one prompt
- Analysis 1: greedy accuracy and variation under sampling
- The TU / AU / EU framework — what each one measures
- Analysis 2: per-clip predictive entropy from single-pass logits
- The triage funnel — VLM as a router for human raters
- Analysis 3: a real AU / EU split via prompt-paraphrase perturbation
- What we still can’t measure
- Key references
1. Why this question matters
Autonomous-vehicle perception teams generate driving log data faster than human annotators can label it. The standard pipeline looks like raw clip → human rater → training set, and the human is the bottleneck. A natural question: can a pretrained video VLM look at the clip first and decide what kind of scenario it shows — at minimum well enough to route clips to the right rater queue, or to flag the unusual ones for closer review?
This post walks through a small empirical study answering that question on Waymo’s End-to-End driving val set. We test three off-the-shelf video VLMs as zero-shot scenario classifiers, then ask not just are they accurate but do their confidence signals tell us anything useful. The headline finding is that only one of the three is calibrated well enough to be used as a triage signal, and getting to that conclusion requires distinguishing several different notions of “uncertainty” — which the post unpacks from first principles.
2. The data: Waymo Open Dataset E2E challenge
Waymo’s End-to-End driving dataset (WOD-E2E) is the camera-only benchmark from the 2024 challenge. Each sequence is 8 seconds of synchronized 8-camera video at 10 Hz, with ego pose and a small set of derived labels.
The label we care about for this experiment is the scenario cluster — a sequence-level tag drawn from a 10-class taxonomy that Waymo published with the challenge:
| Cluster | What it captures |
|---|---|
Intersections (manifest typo: Interections) |
ego approaches / traverses an intersection — original typo preserved everywhere downstream so labels match the published manifest |
| Foreign Object Debris | something on the road that shouldn’t be there |
| Cyclist | a cyclist is the safety-relevant agent |
| Pedestrian | a pedestrian is the safety-relevant agent |
| Multi-Lane Maneuvers | ego changes between two or more lanes |
| Single-Lane Maneuvers | within-lane behavior (slowing, stopping, smooth following) |
| Special Vehicles | emergency vehicle, school bus, etc. |
| Cut_ins | another vehicle cuts into ego’s lane |
| Construction | construction zone present |
| Others | catch-all |
Two facts about these labels matter for what follows:
- Sequence-level, not frame-level. A “Cyclist” label means somewhere in the 8 seconds, the cyclist is the safety-relevant agent. It does not mean the cyclist is visible in every frame.
- Sensor-suite-level, not single-camera-level. A “Cyclist” label does not say which camera the cyclist is in. We verified empirically (by playing back individual sequences) that cyclist sequences often have the cyclist visible only in the rear cameras (CAM_7 and CAM_8) for most of the clip, entering the front camera briefly when ego passes them. This becomes important in the next section.
We sampled 5 clips per cluster × 10 clusters = 50 stratified clips from the val set as our eval set.
3. Stitching 8 cameras into one VLM input
Most video VLMs accept a single video as input — not an arbitrary set of cameras. So we composite the 8 cameras into a single 2×4-tile video at 4 Hz, then hand that to the model:
Why we did not just use the front camera: a “Cyclist” sequence is labeled as such because across the full sensor suite, somewhere in the 8 seconds, a cyclist is the safety-relevant agent. That cyclist often appears first in the rear cameras (overtaking from behind), then in the side cameras, and only briefly in the front camera at the moment ego passes. Front-only judging would systematically miss most of the trajectory and would never see the cyclist on the clips where ego never overtakes.
Below is one of those exact cases — a cyclist visible in CAM_7 / CAM_8 (bottom-right tiles) for most of the clip, only entering the front camera near the end:
For comparison, here are three other clusters, all rendered as the same 8-camera composite:
Intersections — ego approaches an intersection:
Cut_ins — another vehicle cuts into ego’s lane:
Foreign Object Debris — object on the road:
These are the exact MP4 files passed into the VLMs in the experiments below.
4. The judge setup — three video VLMs, one prompt
We test three off-the-shelf, open-weights video VLMs:
| Judge | Backbone | Why we picked it |
|---|---|---|
| Cosmos-Reason 2-2B | Qwen3-VL | NVIDIA’s reasoning-focused video VLM — native video input, small enough to run quickly |
| Video-LLaVA-7B | LanguageBind / Vicuna | Established baseline for video-language tasks |
| Molmo2-8B | Allen AI multimodal stack | Strong recent benchmark on video QA / grounding |
All three receive the same composite video and a prompt that lists the 10 clusters and asks for the dominant scenario. Ground truth is Waymo’s published cluster label.
One methodology bug worth naming up front
The first version of our eval set printed the cluster name in the title bar of every frame (Cyclist | seq 0fff5ea6 | frame 5/32). Both Cosmos and Video-LLaVA were reading the answer off the input, producing artificially high accuracy (62% and 70% respectively). After re-rendering the eval set without the title-bar text and re-running, accuracy collapsed to 20% and 10% — the leak was doing essentially all the work. The numbers reported below are all from the leak-free clean run; the leaked run is preserved as a forensic artifact.
This is the textbook eval-of-eval failure where the measurement instrument contaminates the measurement. The fix is straightforward (no signal correlated with the label appears in the input), but the right takeaway is general: when an off-the-shelf model gives suspiciously good zero-shot numbers on a domain it was not trained on, look for the leak first.
With the leak removed and a clean 50-clip eval set in hand, the next two sections walk through what each judge actually does on this data — first by looking at the answers themselves (Section 5), then by looking at how confident the model is in those answers (Sections 6 and 7).
5. Analysis 1: greedy accuracy and variation under sampling
Greedy top-1 (one forward pass per clip, deterministic)
| Judge | Top-1 vs Waymo | Distinct cluster strings predicted | Failure mode |
|---|---|---|---|
| Cosmos-Reason 2-2B | 20% (10/50) | 12 (case/spacing variants) | Defaults to single_lane_maneuvers on uncertain inputs |
| Video-LLaVA-7B | 10% (5/50) | 1 — Intersections × 50 |
Constant function — copies the prompt’s example values |
| Molmo2-8B | 14% (7/50) | 3 — Intersections (32), Multi-Lane (17), Cyclist (1) | Binary default; ignores 7 of 10 categories |
Random baseline for 10-class classification is 10%. Cosmos is barely above chance, Video-LLaVA is chance, Molmo collapses to a coarser binary than the taxonomy expects. In zero-shot terms, none of these models is usable as a labeling oracle on this dataset.
But the more interesting observation is that all three fail differently. They are not making the same mistake — Cosmos hallucinates scene reasoning, VL ignores the video and copies the prompt, Molmo bins everything into two classes. Three failures with no shared signal is qualitatively different from “three judges making correlated errors”, and it shapes what the rest of the pipeline can do (more on this in Section 8).
Variation across 10 runs at temperature 0.3
Greedy decoding gives one answer per clip. To see how much each judge wobbles when allowed to sample, we re-ran the same 50 clips with temperature=0.3, do_sample=True, N=10 trials per clip, and looked at how often the 10 answers agreed:
| Judge | Clips where all 10 trials agree (unanimous) | Most-common modal-vote prediction |
|---|---|---|
| Cosmos-Reason 2-2B | 31 / 50 | matches greedy answer on those 31 clips |
| Molmo2-8B | 31 / 50 | matches greedy answer on those 31 clips |
| Video-LLaVA-7B | 8 / 50 | wobbles substantially on the other 42 |
Cosmos and Molmo are deterministic-by-default even under sampling: on roughly two-thirds of clips they say the same thing 10 times in a row. Video-LLaVA is the opposite — it is the most variable judge under sampling, despite being the model that produced a perfectly constant Intersections answer under greedy decoding. Sampling exposes that VL has substantial probability mass on alternative tokens that greedy decoding hides; the constant-function behavior is an artifact of the argmax, not of the underlying distribution being narrow.
This is enough to motivate the next question. We have a notion of “the model wobbled across 10 trials,” but it is sparse (binary “did it flip-flop or not” for most clips), and it does not distinguish the model is genuinely uncertain about a hard scene from the model is undertrained and randomly guessing. To separate those, we need real machinery.
6. The TU / AU / EU framework — what each one measures
When a probabilistic classifier hands us a distribution $p$ over classes, the scalar uncertainty in that distribution is its Shannon entropy, in bits:
\[H(p) = -\sum_{i=1}^{K} p_i \log_2 p_i\]- $K$: number of classes (10 here)
- $p_i$: probability the classifier assigns to class $i$
- $H(p) \in [0, \log_2 K]$ — zero when one class has all the mass, $\log_2 K$ when the distribution is uniform
But “uncertainty” is not one thing. The standard decomposition (Houlsby et al. 2011 on BALD; Depeweg et al. 2018 on the AU/EU split) splits it into two qualitatively different sources:
\[\underbrace{H(\bar{p})}_{\text{TU}} \;=\; \underbrace{\frac{1}{N}\sum_{i=1}^{N} H(p_i)}_{\text{AU}} \;+\; \underbrace{H(\bar{p}) \;-\; \frac{1}{N}\sum_{i=1}^{N} H(p_i)}_{\text{EU}}\]- $N$: number of stochastic forward passes (sampled completions, dropout masks, or ensemble members — whatever is producing the variability).
- $p_i$: the model’s predictive distribution on trial $i$ (a 10-dim probability vector here).
- $\bar{p} = \frac{1}{N}\sum_{i=1}^{N} p_i$: the mean predictive distribution across the $N$ trials.
- $H(p_i)$: Shannon entropy of one trial’s distribution, defined as in the equation above.
- Total uncertainty (TU) $= H(\bar{p})$. The entropy of the averaged distribution. It captures how much spread the predictive answers have in aggregate.
- Aleatoric uncertainty (AU) $= \frac{1}{N}\sum_i H(p_i)$. The average per-trial entropy. This is uncertainty that is intrinsic to the input — when even a single confident answer would have spread mass across multiple plausible classes, AU is high. AU is what you get when the scene itself is genuinely ambiguous (cyclist on the edge of a multi-lane maneuver — both labels are defensible).
- Epistemic uncertainty (EU) $= \text{TU} - \text{AU}$. By the math this is the mutual information between the prediction and the model’s randomness. It captures the model disagreeing with itself across trials — different trials give different confident answers — which is the signature of a model that lacks the knowledge to commit. EU is what you get when more training data would shrink the uncertainty.
The decomposition is useful because the two sources demand different responses: high AU means the labeler will probably disagree with themselves too — defer to consensus or accept that this clip is ambiguous; high EU means the model needs to learn more — collect more training data of this type.
There is a crucial caveat in our setup. The decomposition only works when each $p_i$ is a full distribution, not a single sampled class. Section 7 explains why this matters under the cheap single-method setup, and Section 9 then resolves it by running $N=8$ full-distribution forward passes per clip with prompt-paraphrase perturbation.
7. Analysis 2: per-clip predictive entropy from single-pass logits
There are two ways to get the $N$ stochastic predictions the decomposition needs. The first — sample-and-count — runs $N$ forward passes with temperature > 0 and takes the sampled token from each. The second — single-pass logits — runs one forward pass with greedy decoding and reads the model’s full softmax distribution at the answer position.
The two approaches are not interchangeable for the AU/EU split:
- In sample-and-count, each trial produces one sampled class. As a distribution that single answer is one-hot —
[0, 0, 1, 0, ..., 0]— and the entropy of any one-hot distribution is zero. So $\mathbb{E}_i!\left[H(p_i)\right] = 0$, $\text{AU} = 0$, and $\text{EU} = \text{TU}$ identically. We can measure TU but the decomposition collapses. - In single-pass logits, we get the model’s actual softmax distribution at the answer position from one forward pass. This is one full $p$ per clip — no $N$, no average. We can compute $H(p)$ directly as the per-clip predictive entropy, which is a real continuous signal.
Both methods give us something, but neither gives us a real AU/EU split from a single-method run. (To get that, you need $N$ forward passes and full per-trial distributions — for example, $N$ greedy passes with input perturbation, which is what Section 9 then runs.) For this section we ran the single-pass logit version because (a) it is roughly 16× cheaper in GPU time and (b) the per-clip $H(p)$ it produces is a denser, more usable signal than the sparse vote-distribution TU from $N=10$ sampling.
How we extracted the logits
out = model.generate(
**inputs, max_new_tokens=1,
output_scores=True, return_dict_in_generate=True,
do_sample=False, # greedy
)
logits = out.scores[0][0] # full vocab logits at answer position
class_logits = logits[letter_token_ids] # restrict to A..J (the 10 cluster letters)
p = torch.softmax(class_logits.float(), dim=0).cpu().numpy()
H = -(p * np.log2(p + 1e-12)).sum() # bits, ∈ [0, log2(10) ≈ 3.32]
The prompt is reformulated as multi-choice (each cluster gets a letter A-J, the model is asked to answer with one letter) so that the answer position is a single token. We restrict the vocab-sized logit vector to the 10 letter-token IDs and softmax to get a clean 10-class distribution.
Headline numbers (50 clips × 3 judges, single-pass)
| Judge | Top-1 acc | Mean $H(p)$ | Mean top-class prob | $H(p)$ on correct | $H(p)$ on wrong | Escalation $\Delta$ |
|---|---|---|---|---|---|---|
| Cosmos-Reason 2-2B | 20% | 1.700 bits | 0.566 | 1.437 | 1.765 | +0.329 ✓ informative |
| Video-LLaVA-7B | 10% | 0.374 bits | 0.949 | 0.349 | 0.377 | +0.027 ≈ noise |
| Molmo2-8B | 14% | 1.208 bits | 0.718 | 1.352 | 1.185 | −0.167 ✗ anti-informative |
The escalation signal $\Delta$ is mean H(p) on wrong predictions − mean H(p) on correct predictions. If positive, the model is more uncertain when it is wrong — useful as a “flag for human review” signal. If negative, the model is more confident when wrong — actively misleading as a confidence proxy.
Per-cluster and distributional view




Three observations worth knowing
Cosmos’s $H(p)$ is a real continuous escalation signal. The +0.329 bit gap between wrong and correct predictions is comfortably above noise — wrong predictions sit on a noticeably wider distribution than correct ones (see middle panel above). And critically the $H(p)$ from single-pass logits is continuous: every clip gets a real-valued entropy, so production code can do if H > 1.6: send to human rather than the binary did_it_flip_flop_under_sampling you would get from a vote-distribution proxy on $N$ sampled trials. A Cosmos-based learned evaluator can use $H(p)$ as a graded confidence score that actually orders clips by risk.
Video-LLaVA is severely overconfident. Mean top-class probability is 0.95 with top-1 accuracy of 10%. The reliability diagram makes this concrete — VL’s predictions live almost entirely in the 0.9-1.0 confidence bin, where the empirical accuracy is also ~10%. Any downstream pipeline that gated on top-class probability would over-trust this judge by an order of magnitude. The escalation signal $\Delta = +0.027$ is essentially noise — VL’s wrong and correct predictions are equally low-entropy.
Molmo’s $H(p)$ goes the wrong way. Mean $H(p)$ on correct clips (1.35) is higher than mean $H(p)$ on wrong clips (1.19). Two readings, both bad for production use: either the correct answers happen on hard clips where Molmo is correctly uncertain (and just gets lucky on the argmax), or the multi-choice letter-mapping interacts with Molmo’s tokenizer in a way that distorts the softmax. Either way, Molmo’s $H(p)$ cannot be used as an escalation signal — gating on H > threshold would systematically suppress correct answers.
The general lesson: per-judge calibration must be measured, not assumed. Three judges, three completely different relationships between predicted confidence and empirical accuracy. A learned-eval framework that batched these 3 models behind a generic “if confidence high, accept” rule would silently produce a bias-amplifying pipeline.
8. The triage funnel — VLM as a router for human raters
The point of running this experiment is not to replace human raters with VLMs — none of the three judges is accurate enough for that, and even if they were, the AV labeling problem is too high-stakes to hand to a 20%-accurate model. The point is to use the VLM as a router that splits the incoming clip stream into the right downstream queue:
The three branches map to three different downstream costs:
- Auto-bin — low $H(p)$ AND all 3 judges agree → batch-accept the cluster label without human review. The cheapest route. From our 50-clip sample this would catch the easy intersections and the unambiguous cyclists, freeing rater time for the genuinely hard clips. (In our run, 0 of 50 clips had all-3-agree-and-correct, but that’s a function of how bad these particular zero-shot judges are; a fine-tuned Cosmos would dramatically improve this.)
- Human review — high $H(p)$ OR judges disagree → send to a rater. Standard cost. This is the case where the VLMs admit uncertainty (or contradict each other), which is exactly when human judgment is most valuable.
- Novel-scenario candidate — high $H(p)$ AND none of the judges is confident AND the predicted distribution is roughly uniform across multiple non-default classes → flag as potentially out-of-taxonomy. This is the most interesting bucket. If the VLMs collectively give up — none of them lock onto a confident prediction and the class distribution looks like the model is groping — it is a hint that the clip might not fit any of the existing 10 categories cleanly. Routing those clips to a taxonomy-review queue (rather than to a regular rater) lets the dataset evolve.
The calibration finding from Section 7 is the gating constraint on this design. Only Cosmos’s $H(p)$ is monotonic with correctness. Video-LLaVA’s confidence is meaningless (95% confident at 10% accuracy), and Molmo’s $H(p)$ goes the wrong way. So the practical funnel uses Cosmos’s $H(p)$ as the primary uncertainty gate, with the other two judges contributing as cross-judge agreement checks rather than as confidence sources. Routing rules cannot be uniform across judges — they must be derived per judge from a calibration set.
9. Analysis 3: a real AU / EU split via prompt-paraphrase perturbation
Section 7 produced one full softmax distribution per clip — enough for a per-clip $H(p)$, but not enough for the AU / EU decomposition (which needs $N$ distributions). To get a real split we need multiple full distributions per clip — same model, same letter-mapping, different something. The natural choices are: perturb the visual input ($N$ different temporal samplings of the video, or $N$ camera-subsets) or perturb the language input ($N$ paraphrases of the question). We picked prompt paraphrasing because it is uniform across all 3 judges, requires zero video work, and is honestly scoped — it measures language-side sensitivity of the prediction. Visual-side perturbation is a follow-up.
Setup
For each clip × judge × N=8 prompt phrasings (each a different framing of the same multi-choice question, with the same A-J letter→cluster mapping appended verbatim), we run one greedy forward pass and capture the full 10-class softmax distribution. We then compute the decomposition above on that list of 8 distributions.
Verification gate
Before applying to real data, the decomposition function (decompose(p_list) in analysis/uncertainty.py) is gated by 9 new tests added to the existing 21-test suite — covering unanimous one-hots ($\text{TU} = \text{AU} = \text{EU} = 0$), consistent uniform ($\text{TU} = \text{AU} = \log_2 K$, $\text{EU} = 0$ — pure aleatoric), disagreeing one-hots ($\text{AU} = 0$, $\text{EU} = \text{TU}$ — pure epistemic), the identity $\text{TU} = \text{AU} + \text{EU}$ on random Dirichlet samples, and Jensen’s inequality $\text{AU} \le \text{TU}$. All 30 tests pass before this section’s numbers exist.
On real data: identity holds to within $10^{-9}$ for all 50 × 3 = 150 decompositions, and AU is strictly positive on all 150 clips (sanity-checks that paraphrasing produced meaningful per-trial variation).
Headline numbers (50 clips × 3 judges, N=8 paraphrases per clip)
| Judge | Modal acc | Mean TU | Mean AU | Mean EU | AU escalation $\Delta$ | EU escalation $\Delta$ |
|---|---|---|---|---|---|---|
| Cosmos-Reason 2-2B | 22% | 2.152 bits | 2.065 | 0.087 | +0.243 | +0.011 |
| Video-LLaVA-7B | 10% | 0.410 bits | 0.391 | 0.019 | +0.045 | +0.004 |
| Molmo2-8B | 12% | 1.591 bits | 1.449 | 0.141 | +0.111 | −0.030 |
Two facts jump out and shape the rest of the section:
- AU dominates EU by roughly 10× for all three judges. Almost all of the per-clip uncertainty under prompt paraphrasing is the model spreading mass within each phrasing’s distribution, not phrasings disagreeing with each other.
- Cosmos’s AU is huge — 2.07 bits, which is 62% of the maximum possible entropy $\log_2 10 = 3.32$. The model is genuinely hedging on each individual answer.
What this means: the model is consistent, but consistently uncertain
The two regimes that the AU / EU decomposition is designed to separate look like this:
- Low EU = the model gives the same kind of distribution regardless of which phrasing you use. The mean across 8 phrasings doesn’t differ much from any single phrasing. Operationally: the model is robust to prompt choice.
- High AU = each individual phrasing’s distribution is itself spread across multiple classes. The model is hedging within every single forward pass. Operationally: the model thinks the scene supports multiple labels.
Putting those together: all 3 judges are consistent across phrasings (low EU), but Cosmos and Molmo are individually uncertain (high AU); Video-LLaVA is individually narrow (low AU) and consistently narrow across phrasings (low EU) — i.e., consistently confidently-wrong. This corroborates the Phase 4 finding (VL was 95% confident at 10% accuracy) through a completely different lens: VL’s narrowness is not an artifact of greedy decoding, and it is not paraphrasing-sensitive — the model just locks in.
Per-cluster picture

For Cosmos, mean AU is high across nearly every cluster — the model doesn’t have one “easy” cluster type and one “hard” type, it’s broadly hedged. EU is tiny everywhere. Same shape for Molmo (just at lower magnitude). VL is the odd one out — both AU and EU are near zero on every cluster, which is the calibration story re-told.
AU vs EU per clip — the operational quadrant

The (AU, EU) quadrant has four operational meanings (high or low for each axis). Empirically these three judges live in only two of the four:
- Cosmos and Molmo: high AU, low EU — “the model is consistently saying ‘this scene is ambiguous to me’.” Under prompt paraphrasing, this is the dominant regime.
- Video-LLaVA: low AU, low EU — “the model is consistently saying ‘I am sure’.” Combined with 10% accuracy, this is the worst possible calibration.
- (High EU, low AU — “the model is confident on each pass but they disagree” — is empty under prompt paraphrasing for these models. To populate that quadrant we would need visual perturbation, where we expect EU to grow.)
AU and EU as escalation signals — which is the better wrong-answer flag?

The AU row has signal — Cosmos’s wrong predictions sit on a noticeably wider distribution than its correct ones (+0.243 bits, the strongest escalation Δ in this whole study). Molmo also shows a positive AU $\Delta$ (+0.111 bits), so Molmo’s AU is more usable as an escalation signal than its Phase 4 $H(p)$ was (which actually went the wrong way at −0.167). The EU row is essentially noise for all three judges, which makes operational sense given the paraphrase-consistency finding.
How this changes the production routing rule
In Section 8 we routed clips by Cosmos’s $H(p)$ alone because it was the only signal that monotonically tracked correctness. Phase 5 lets us do better: route by AU, with EU as a secondary “did the model give different confident answers under paraphrasing?” check that flags clips for prompt-robustness review specifically. The routing rule from Section 8 becomes:
| Bucket | Trigger | Downstream |
|---|---|---|
| Auto-bin | low AU AND all 3 judges’ modal predictions agree | batch-accept the cluster label |
| Standard human review | high AU OR judges disagree | regular rater queue |
| Prompt-robustness review | high EU (rare under paraphrasing, common under visual perturbation when we add it) | prompt-design audit + senior rater |
| Novel-scenario candidate | high AU AND no judge confident AND $\bar{p}$ is roughly uniform | taxonomy-review queue |
The Section 8 funnel still applies; Phase 5 just splits the “high-uncertainty” branch into AU-driven and EU-driven sub-branches, giving the human rater more information about why the VLM is uncertain.
What this round explicitly doesn’t measure
This round measures language-side uncertainty only. The follow-up is visual-side perturbation: re-render each clip with $N$ different temporal samplings (different 8-second windows of the same scene) or $N$ different camera dropouts (front-only, rear-only, side-only). Under visual perturbation we expect EU to grow — the prediction may flip when the cyclist drops out of CAM_7+CAM_8 — and we can compare that EU against the AU we measured here. The clean version of the operational story is “AU from prompt paraphrasing + EU from visual perturbation”, and Phase 5 only delivers half of that.
10. What we still can’t measure
Two known gaps remain after Phase 5:
Cross-judge ensemble calibration. Section 8’s design uses three judges as cross-checks, but we have not measured whether ensemble agreement (e.g. all-3-confidently-agree) is itself a calibrated signal. It probably isn’t on this small sample — and given that all 3 judges fail differently, an ensemble may be no better calibrated than any single one. Worth running on a ≥500-clip set before deploying anything.
No fine-tuning baseline. All three judges are zero-shot. The natural comparison is Cosmos with LoRA fine-tuning on a 200-clip Waymo-labeled subset — based on similar literature, expect a 30-50 percentage-point lift on top-1 accuracy. The interesting question is whether fine-tuning also improves the calibration of $H(p)$ and AU, or whether the model just gets more confidently wrong.
11. Key references
| Year | Paper / Resource | Relevance |
|---|---|---|
| 2011 | Houlsby et al., Bayesian Active Learning by Disagreement (BALD) | The mutual-information formulation of epistemic uncertainty used in the TU = AU + EU decomposition |
| 2017 | Kendall & Gal, What Uncertainties Do We Need in Bayesian Deep Learning? | Canonical reference for the aleatoric / epistemic split in deep models |
| 2018 | Depeweg et al., Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-Sensitive Learning | The exact AU/EU decomposition we used |
| 2024 | Waymo, End-to-End Driving Dataset / Challenge | The dataset and the 10-cluster taxonomy |
| 2024 | LanguageBind / Lin et al., Video-LLaVA | One of the three judges |
| 2024 | NVIDIA, Cosmos-Reason 2 | One of the three judges (Qwen3-VL-based) |
| 2024 | Allen AI, Molmo / Molmo2 | One of the three judges |
| 2017 | Guo et al., On Calibration of Modern Neural Networks | Background on reliability diagrams and the kind of post-hoc recalibration (Platt scaling, isotonic) that would be the next step for any of these three judges before production use |