Video VLMs as Judges on Waymo's E2E Driving Set: A First-Principles Walkthrough

Target audience: ML practitioners with general transformer/VLM background who want to know whether off-the-shelf video vision-language models can sit in front of a human-rater queue on autonomous-vehicle scenario data — and what predictive uncertainty actually buys you in that role.


Table of Contents

  1. Why this question matters
  2. The data: Waymo Open Dataset E2E challenge
  3. Stitching 8 cameras into one VLM input
  4. The judge setup — three video VLMs, one prompt
  5. Analysis 1: greedy accuracy and variation under sampling
  6. The TU / AU / EU framework — what each one measures
  7. Analysis 2: per-clip predictive entropy from single-pass logits
  8. The triage funnel — VLM as a router for human raters
  9. Analysis 3: a real AU / EU split via prompt-paraphrase perturbation
  10. What we still can’t measure
  11. Key references

1. Why this question matters

Autonomous-vehicle perception teams generate driving log data faster than human annotators can label it. The standard pipeline looks like raw cliphuman ratertraining set, and the human is the bottleneck. A natural question: can a pretrained video VLM look at the clip first and decide what kind of scenario it shows — at minimum well enough to route clips to the right rater queue, or to flag the unusual ones for closer review?

This post walks through a small empirical study answering that question on Waymo’s End-to-End driving val set. We test three off-the-shelf video VLMs as zero-shot scenario classifiers, then ask not just are they accurate but do their confidence signals tell us anything useful. The headline finding is that only one of the three is calibrated well enough to be used as a triage signal, and getting to that conclusion requires distinguishing several different notions of “uncertainty” — which the post unpacks from first principles.


2. The data: Waymo Open Dataset E2E challenge

Waymo’s End-to-End driving dataset (WOD-E2E) is the camera-only benchmark from the 2024 challenge. Each sequence is 8 seconds of synchronized 8-camera video at 10 Hz, with ego pose and a small set of derived labels.

The label we care about for this experiment is the scenario cluster — a sequence-level tag drawn from a 10-class taxonomy that Waymo published with the challenge:

Cluster What it captures
Intersections (manifest typo: Interections) ego approaches / traverses an intersection — original typo preserved everywhere downstream so labels match the published manifest
Foreign Object Debris something on the road that shouldn’t be there
Cyclist a cyclist is the safety-relevant agent
Pedestrian a pedestrian is the safety-relevant agent
Multi-Lane Maneuvers ego changes between two or more lanes
Single-Lane Maneuvers within-lane behavior (slowing, stopping, smooth following)
Special Vehicles emergency vehicle, school bus, etc.
Cut_ins another vehicle cuts into ego’s lane
Construction construction zone present
Others catch-all

Two facts about these labels matter for what follows:

  1. Sequence-level, not frame-level. A “Cyclist” label means somewhere in the 8 seconds, the cyclist is the safety-relevant agent. It does not mean the cyclist is visible in every frame.
  2. Sensor-suite-level, not single-camera-level. A “Cyclist” label does not say which camera the cyclist is in. We verified empirically (by playing back individual sequences) that cyclist sequences often have the cyclist visible only in the rear cameras (CAM_7 and CAM_8) for most of the clip, entering the front camera briefly when ego passes them. This becomes important in the next section.

We sampled 5 clips per cluster × 10 clusters = 50 stratified clips from the val set as our eval set.


3. Stitching 8 cameras into one VLM input

Most video VLMs accept a single video as input — not an arbitrary set of cameras. So we composite the 8 cameras into a single 2×4-tile video at 4 Hz, then hand that to the model:

8-camera composite layout showing four forward-facing tiles in row 1 (FRONT_LEFT, FRONT, FRONT_RIGHT, SIDE_LEFT) and four mixed tiles in row 2 (SIDE_RIGHT, CAM_6, CAM_7 rear, CAM_8 rear), with a dashed cyclist trajectory traced from CAM_7 to CAM_8 to FRONT and a small ego-vehicle marker in the center.

Why we did not just use the front camera: a “Cyclist” sequence is labeled as such because across the full sensor suite, somewhere in the 8 seconds, a cyclist is the safety-relevant agent. That cyclist often appears first in the rear cameras (overtaking from behind), then in the side cameras, and only briefly in the front camera at the moment ego passes. Front-only judging would systematically miss most of the trajectory and would never see the cyclist on the clips where ego never overtakes.

Below is one of those exact cases — a cyclist visible in CAM_7 / CAM_8 (bottom-right tiles) for most of the clip, only entering the front camera near the end:

For comparison, here are three other clusters, all rendered as the same 8-camera composite:

Intersections — ego approaches an intersection:

Cut_ins — another vehicle cuts into ego’s lane:

Foreign Object Debris — object on the road:

These are the exact MP4 files passed into the VLMs in the experiments below.


4. The judge setup — three video VLMs, one prompt

We test three off-the-shelf, open-weights video VLMs:

Judge Backbone Why we picked it
Cosmos-Reason 2-2B Qwen3-VL NVIDIA’s reasoning-focused video VLM — native video input, small enough to run quickly
Video-LLaVA-7B LanguageBind / Vicuna Established baseline for video-language tasks
Molmo2-8B Allen AI multimodal stack Strong recent benchmark on video QA / grounding

All three receive the same composite video and a prompt that lists the 10 clusters and asks for the dominant scenario. Ground truth is Waymo’s published cluster label.

Single-pass logit-extraction pipeline: composite 8-camera video flows into the VLM, the multi-choice prompt is appended, the model's logits at the answer position are restricted to the 10 letter-token IDs A-J, softmaxed into a 10-class distribution, then summarized as a per-clip predictive entropy H of p.

One methodology bug worth naming up front

The first version of our eval set printed the cluster name in the title bar of every frame (Cyclist | seq 0fff5ea6 | frame 5/32). Both Cosmos and Video-LLaVA were reading the answer off the input, producing artificially high accuracy (62% and 70% respectively). After re-rendering the eval set without the title-bar text and re-running, accuracy collapsed to 20% and 10% — the leak was doing essentially all the work. The numbers reported below are all from the leak-free clean run; the leaked run is preserved as a forensic artifact.

This is the textbook eval-of-eval failure where the measurement instrument contaminates the measurement. The fix is straightforward (no signal correlated with the label appears in the input), but the right takeaway is general: when an off-the-shelf model gives suspiciously good zero-shot numbers on a domain it was not trained on, look for the leak first.

With the leak removed and a clean 50-clip eval set in hand, the next two sections walk through what each judge actually does on this data — first by looking at the answers themselves (Section 5), then by looking at how confident the model is in those answers (Sections 6 and 7).


5. Analysis 1: greedy accuracy and variation under sampling

Greedy top-1 (one forward pass per clip, deterministic)

Judge Top-1 vs Waymo Distinct cluster strings predicted Failure mode
Cosmos-Reason 2-2B 20% (10/50) 12 (case/spacing variants) Defaults to single_lane_maneuvers on uncertain inputs
Video-LLaVA-7B 10% (5/50) 1 — Intersections × 50 Constant function — copies the prompt’s example values
Molmo2-8B 14% (7/50) 3 — Intersections (32), Multi-Lane (17), Cyclist (1) Binary default; ignores 7 of 10 categories

Random baseline for 10-class classification is 10%. Cosmos is barely above chance, Video-LLaVA is chance, Molmo collapses to a coarser binary than the taxonomy expects. In zero-shot terms, none of these models is usable as a labeling oracle on this dataset.

But the more interesting observation is that all three fail differently. They are not making the same mistake — Cosmos hallucinates scene reasoning, VL ignores the video and copies the prompt, Molmo bins everything into two classes. Three failures with no shared signal is qualitatively different from “three judges making correlated errors”, and it shapes what the rest of the pipeline can do (more on this in Section 8).

Variation across 10 runs at temperature 0.3

Greedy decoding gives one answer per clip. To see how much each judge wobbles when allowed to sample, we re-ran the same 50 clips with temperature=0.3, do_sample=True, N=10 trials per clip, and looked at how often the 10 answers agreed:

Judge Clips where all 10 trials agree (unanimous) Most-common modal-vote prediction
Cosmos-Reason 2-2B 31 / 50 matches greedy answer on those 31 clips
Molmo2-8B 31 / 50 matches greedy answer on those 31 clips
Video-LLaVA-7B 8 / 50 wobbles substantially on the other 42

Cosmos and Molmo are deterministic-by-default even under sampling: on roughly two-thirds of clips they say the same thing 10 times in a row. Video-LLaVA is the opposite — it is the most variable judge under sampling, despite being the model that produced a perfectly constant Intersections answer under greedy decoding. Sampling exposes that VL has substantial probability mass on alternative tokens that greedy decoding hides; the constant-function behavior is an artifact of the argmax, not of the underlying distribution being narrow.

This is enough to motivate the next question. We have a notion of “the model wobbled across 10 trials,” but it is sparse (binary “did it flip-flop or not” for most clips), and it does not distinguish the model is genuinely uncertain about a hard scene from the model is undertrained and randomly guessing. To separate those, we need real machinery.


6. The TU / AU / EU framework — what each one measures

When a probabilistic classifier hands us a distribution $p$ over classes, the scalar uncertainty in that distribution is its Shannon entropy, in bits:

\[H(p) = -\sum_{i=1}^{K} p_i \log_2 p_i\]

But “uncertainty” is not one thing. The standard decomposition (Houlsby et al. 2011 on BALD; Depeweg et al. 2018 on the AU/EU split) splits it into two qualitatively different sources:

\[\underbrace{H(\bar{p})}_{\text{TU}} \;=\; \underbrace{\frac{1}{N}\sum_{i=1}^{N} H(p_i)}_{\text{AU}} \;+\; \underbrace{H(\bar{p}) \;-\; \frac{1}{N}\sum_{i=1}^{N} H(p_i)}_{\text{EU}}\]

The decomposition is useful because the two sources demand different responses: high AU means the labeler will probably disagree with themselves too — defer to consensus or accept that this clip is ambiguous; high EU means the model needs to learn more — collect more training data of this type.

There is a crucial caveat in our setup. The decomposition only works when each $p_i$ is a full distribution, not a single sampled class. Section 7 explains why this matters under the cheap single-method setup, and Section 9 then resolves it by running $N=8$ full-distribution forward passes per clip with prompt-paraphrase perturbation.


7. Analysis 2: per-clip predictive entropy from single-pass logits

There are two ways to get the $N$ stochastic predictions the decomposition needs. The first — sample-and-count — runs $N$ forward passes with temperature > 0 and takes the sampled token from each. The second — single-pass logits — runs one forward pass with greedy decoding and reads the model’s full softmax distribution at the answer position.

Side-by-side comparison: sample-and-count (left, blue) shows 10 one-hot trial bars stacked vertically with their aggregate vote distribution and the formulas TU = H of p-bar, AU = expectation of H of p_i = 0 highlighted in red, EU = TU; single-pass logits (right, green) shows one continuous 10-class softmax distribution and the formula H(p) per clip.

The two approaches are not interchangeable for the AU/EU split:

Both methods give us something, but neither gives us a real AU/EU split from a single-method run. (To get that, you need $N$ forward passes and full per-trial distributions — for example, $N$ greedy passes with input perturbation, which is what Section 9 then runs.) For this section we ran the single-pass logit version because (a) it is roughly 16× cheaper in GPU time and (b) the per-clip $H(p)$ it produces is a denser, more usable signal than the sparse vote-distribution TU from $N=10$ sampling.

How we extracted the logits

out = model.generate(
    **inputs, max_new_tokens=1,
    output_scores=True, return_dict_in_generate=True,
    do_sample=False,                          # greedy
)
logits   = out.scores[0][0]                   # full vocab logits at answer position
class_logits = logits[letter_token_ids]       # restrict to A..J (the 10 cluster letters)
p        = torch.softmax(class_logits.float(), dim=0).cpu().numpy()
H        = -(p * np.log2(p + 1e-12)).sum()    # bits, ∈ [0, log2(10) ≈ 3.32]

The prompt is reformulated as multi-choice (each cluster gets a letter A-J, the model is asked to answer with one letter) so that the answer position is a single token. We restrict the vocab-sized logit vector to the 10 letter-token IDs and softmax to get a clean 10-class distribution.

Headline numbers (50 clips × 3 judges, single-pass)

Judge Top-1 acc Mean $H(p)$ Mean top-class prob $H(p)$ on correct $H(p)$ on wrong Escalation $\Delta$
Cosmos-Reason 2-2B 20% 1.700 bits 0.566 1.437 1.765 +0.329 ✓ informative
Video-LLaVA-7B 10% 0.374 bits 0.949 0.349 0.377 +0.027 ≈ noise
Molmo2-8B 14% 1.208 bits 0.718 1.352 1.185 −0.167 ✗ anti-informative

The escalation signal $\Delta$ is mean H(p) on wrong predictions − mean H(p) on correct predictions. If positive, the model is more uncertain when it is wrong — useful as a “flag for human review” signal. If negative, the model is more confident when wrong — actively misleading as a confidence proxy.

Per-cluster and distributional view

Grouped bar chart showing mean H(p) per cluster for each of the three judges across all 10 Waymo clusters, with a dashed reference line at the maximum entropy log2 of 10 ≈ 3.32 bits. Cosmos and Molmo bars are noticeably taller across most clusters than Video-LLaVA, which sits near zero on most.

Three-panel histogram showing H(p) split by correct (green) versus wrong (red) predictions for each judge. Cosmos shows wrong predictions clustering at higher entropy than correct ones (delta = +0.33 bits). Video-LLaVA shows almost all clips at low entropy regardless of correctness. Molmo shows correct predictions at higher entropy than wrong ones (delta = -0.17).

Three-panel histogram of the per-judge H(p) distribution across all 50 clips. Cosmos has a broad distribution centered around 1.7 bits, Video-LLaVA is sharply concentrated near zero, Molmo is roughly uniformly spread between 0.5 and 2.0 bits.

Reliability diagram (3 panels, one per judge) plotting top-class probability bin (x-axis) against empirical accuracy in that bin (y-axis), with a dashed perfect-calibration y = x diagonal. Cosmos sits below the diagonal across the prob range. Video-LLaVA's points are concentrated in the 0.9-1.0 confidence bin with empirical accuracy near 0.1 — extreme overconfidence. Molmo is bimodal with no monotonic relationship.

Three observations worth knowing

Cosmos’s $H(p)$ is a real continuous escalation signal. The +0.329 bit gap between wrong and correct predictions is comfortably above noise — wrong predictions sit on a noticeably wider distribution than correct ones (see middle panel above). And critically the $H(p)$ from single-pass logits is continuous: every clip gets a real-valued entropy, so production code can do if H > 1.6: send to human rather than the binary did_it_flip_flop_under_sampling you would get from a vote-distribution proxy on $N$ sampled trials. A Cosmos-based learned evaluator can use $H(p)$ as a graded confidence score that actually orders clips by risk.

Video-LLaVA is severely overconfident. Mean top-class probability is 0.95 with top-1 accuracy of 10%. The reliability diagram makes this concrete — VL’s predictions live almost entirely in the 0.9-1.0 confidence bin, where the empirical accuracy is also ~10%. Any downstream pipeline that gated on top-class probability would over-trust this judge by an order of magnitude. The escalation signal $\Delta = +0.027$ is essentially noise — VL’s wrong and correct predictions are equally low-entropy.

Molmo’s $H(p)$ goes the wrong way. Mean $H(p)$ on correct clips (1.35) is higher than mean $H(p)$ on wrong clips (1.19). Two readings, both bad for production use: either the correct answers happen on hard clips where Molmo is correctly uncertain (and just gets lucky on the argmax), or the multi-choice letter-mapping interacts with Molmo’s tokenizer in a way that distorts the softmax. Either way, Molmo’s $H(p)$ cannot be used as an escalation signal — gating on H > threshold would systematically suppress correct answers.

The general lesson: per-judge calibration must be measured, not assumed. Three judges, three completely different relationships between predicted confidence and empirical accuracy. A learned-eval framework that batched these 3 models behind a generic “if confidence high, accept” rule would silently produce a bias-amplifying pipeline.


8. The triage funnel — VLM as a router for human raters

The point of running this experiment is not to replace human raters with VLMs — none of the three judges is accurate enough for that, and even if they were, the AV labeling problem is too high-stakes to hand to a 20%-accurate model. The point is to use the VLM as a router that splits the incoming clip stream into the right downstream queue:

Triage funnel diagram: incoming clip stream (blue) flows into three VLM judges (orange center node), then fans out into three outcome routes — auto-bin in green for clips where all 3 judges agree and H(p) is low, human review in orange for clips where H(p) is high or judges disagree, and novel-scenario candidate in purple for clips where H(p) is high and the distribution is uniform-ish. A dashed callout below the central node notes that only Cosmos's H(p) is calibrated — per-judge gating is required.

The three branches map to three different downstream costs:

  1. Auto-bin — low $H(p)$ AND all 3 judges agree → batch-accept the cluster label without human review. The cheapest route. From our 50-clip sample this would catch the easy intersections and the unambiguous cyclists, freeing rater time for the genuinely hard clips. (In our run, 0 of 50 clips had all-3-agree-and-correct, but that’s a function of how bad these particular zero-shot judges are; a fine-tuned Cosmos would dramatically improve this.)
  2. Human review — high $H(p)$ OR judges disagree → send to a rater. Standard cost. This is the case where the VLMs admit uncertainty (or contradict each other), which is exactly when human judgment is most valuable.
  3. Novel-scenario candidate — high $H(p)$ AND none of the judges is confident AND the predicted distribution is roughly uniform across multiple non-default classes → flag as potentially out-of-taxonomy. This is the most interesting bucket. If the VLMs collectively give up — none of them lock onto a confident prediction and the class distribution looks like the model is groping — it is a hint that the clip might not fit any of the existing 10 categories cleanly. Routing those clips to a taxonomy-review queue (rather than to a regular rater) lets the dataset evolve.

The calibration finding from Section 7 is the gating constraint on this design. Only Cosmos’s $H(p)$ is monotonic with correctness. Video-LLaVA’s confidence is meaningless (95% confident at 10% accuracy), and Molmo’s $H(p)$ goes the wrong way. So the practical funnel uses Cosmos’s $H(p)$ as the primary uncertainty gate, with the other two judges contributing as cross-judge agreement checks rather than as confidence sources. Routing rules cannot be uniform across judges — they must be derived per judge from a calibration set.


9. Analysis 3: a real AU / EU split via prompt-paraphrase perturbation

Section 7 produced one full softmax distribution per clip — enough for a per-clip $H(p)$, but not enough for the AU / EU decomposition (which needs $N$ distributions). To get a real split we need multiple full distributions per clip — same model, same letter-mapping, different something. The natural choices are: perturb the visual input ($N$ different temporal samplings of the video, or $N$ camera-subsets) or perturb the language input ($N$ paraphrases of the question). We picked prompt paraphrasing because it is uniform across all 3 judges, requires zero video work, and is honestly scoped — it measures language-side sensitivity of the prediction. Visual-side perturbation is a follow-up.

Setup

For each clip × judge × N=8 prompt phrasings (each a different framing of the same multi-choice question, with the same A-J letter→cluster mapping appended verbatim), we run one greedy forward pass and capture the full 10-class softmax distribution. We then compute the decomposition above on that list of 8 distributions.

Verification gate

Before applying to real data, the decomposition function (decompose(p_list) in analysis/uncertainty.py) is gated by 9 new tests added to the existing 21-test suite — covering unanimous one-hots ($\text{TU} = \text{AU} = \text{EU} = 0$), consistent uniform ($\text{TU} = \text{AU} = \log_2 K$, $\text{EU} = 0$ — pure aleatoric), disagreeing one-hots ($\text{AU} = 0$, $\text{EU} = \text{TU}$ — pure epistemic), the identity $\text{TU} = \text{AU} + \text{EU}$ on random Dirichlet samples, and Jensen’s inequality $\text{AU} \le \text{TU}$. All 30 tests pass before this section’s numbers exist.

On real data: identity holds to within $10^{-9}$ for all 50 × 3 = 150 decompositions, and AU is strictly positive on all 150 clips (sanity-checks that paraphrasing produced meaningful per-trial variation).

Headline numbers (50 clips × 3 judges, N=8 paraphrases per clip)

Judge Modal acc Mean TU Mean AU Mean EU AU escalation $\Delta$ EU escalation $\Delta$
Cosmos-Reason 2-2B 22% 2.152 bits 2.065 0.087 +0.243 +0.011
Video-LLaVA-7B 10% 0.410 bits 0.391 0.019 +0.045 +0.004
Molmo2-8B 12% 1.591 bits 1.449 0.141 +0.111 −0.030

Two facts jump out and shape the rest of the section:

  1. AU dominates EU by roughly 10× for all three judges. Almost all of the per-clip uncertainty under prompt paraphrasing is the model spreading mass within each phrasing’s distribution, not phrasings disagreeing with each other.
  2. Cosmos’s AU is huge — 2.07 bits, which is 62% of the maximum possible entropy $\log_2 10 = 3.32$. The model is genuinely hedging on each individual answer.

What this means: the model is consistent, but consistently uncertain

The two regimes that the AU / EU decomposition is designed to separate look like this:

Putting those together: all 3 judges are consistent across phrasings (low EU), but Cosmos and Molmo are individually uncertain (high AU); Video-LLaVA is individually narrow (low AU) and consistently narrow across phrasings (low EU) — i.e., consistently confidently-wrong. This corroborates the Phase 4 finding (VL was 95% confident at 10% accuracy) through a completely different lens: VL’s narrowness is not an artifact of greedy decoding, and it is not paraphrasing-sensitive — the model just locks in.

Per-cluster picture

Two stacked grouped-bar charts. Top panel: per-cluster mean AU for the three judges across all 10 Waymo clusters; Cosmos's bars dominate, with Multi-Lane Maneuvers and Single-Lane Maneuvers near the maximum entropy ceiling. Bottom panel: per-cluster mean EU, with all three judges sitting near zero across the board.

For Cosmos, mean AU is high across nearly every cluster — the model doesn’t have one “easy” cluster type and one “hard” type, it’s broadly hedged. EU is tiny everywhere. Same shape for Molmo (just at lower magnitude). VL is the odd one out — both AU and EU are near zero on every cluster, which is the calibration story re-told.

AU vs EU per clip — the operational quadrant

Three-panel scatter plot, one per judge, showing each clip's AU on the x-axis and EU on the y-axis with dotted guide lines at 0.5. Cosmos points are clustered along a vertical strip at high AU (~2 bits) and low EU (~0.1 bits). Video-LLaVA points are clustered tightly near the origin. Molmo points are spread along a similar vertical strip at moderate AU (~1.5 bits) and low EU. Correct predictions are colored green, wrong predictions red, with no obvious horizontal separation between them in any panel.

The (AU, EU) quadrant has four operational meanings (high or low for each axis). Empirically these three judges live in only two of the four:

AU and EU as escalation signals — which is the better wrong-answer flag?

Six-panel histogram (3 judges × 2 metrics) showing each metric's distribution split by correct (green) versus wrong (red) predictions. Top row is AU: Cosmos shows a clear right-shift of the wrong-prediction histogram by +0.243 bits, Molmo by +0.111 bits, Video-LLaVA shows a small +0.045 shift. Bottom row is EU: all three judges show essentially overlapping distributions for correct and wrong, with Δ values close to zero.

The AU row has signal — Cosmos’s wrong predictions sit on a noticeably wider distribution than its correct ones (+0.243 bits, the strongest escalation Δ in this whole study). Molmo also shows a positive AU $\Delta$ (+0.111 bits), so Molmo’s AU is more usable as an escalation signal than its Phase 4 $H(p)$ was (which actually went the wrong way at −0.167). The EU row is essentially noise for all three judges, which makes operational sense given the paraphrase-consistency finding.

How this changes the production routing rule

In Section 8 we routed clips by Cosmos’s $H(p)$ alone because it was the only signal that monotonically tracked correctness. Phase 5 lets us do better: route by AU, with EU as a secondary “did the model give different confident answers under paraphrasing?” check that flags clips for prompt-robustness review specifically. The routing rule from Section 8 becomes:

Bucket Trigger Downstream
Auto-bin low AU AND all 3 judges’ modal predictions agree batch-accept the cluster label
Standard human review high AU OR judges disagree regular rater queue
Prompt-robustness review high EU (rare under paraphrasing, common under visual perturbation when we add it) prompt-design audit + senior rater
Novel-scenario candidate high AU AND no judge confident AND $\bar{p}$ is roughly uniform taxonomy-review queue

The Section 8 funnel still applies; Phase 5 just splits the “high-uncertainty” branch into AU-driven and EU-driven sub-branches, giving the human rater more information about why the VLM is uncertain.

What this round explicitly doesn’t measure

This round measures language-side uncertainty only. The follow-up is visual-side perturbation: re-render each clip with $N$ different temporal samplings (different 8-second windows of the same scene) or $N$ different camera dropouts (front-only, rear-only, side-only). Under visual perturbation we expect EU to grow — the prediction may flip when the cyclist drops out of CAM_7+CAM_8 — and we can compare that EU against the AU we measured here. The clean version of the operational story is “AU from prompt paraphrasing + EU from visual perturbation”, and Phase 5 only delivers half of that.


10. What we still can’t measure

Two known gaps remain after Phase 5:

Cross-judge ensemble calibration. Section 8’s design uses three judges as cross-checks, but we have not measured whether ensemble agreement (e.g. all-3-confidently-agree) is itself a calibrated signal. It probably isn’t on this small sample — and given that all 3 judges fail differently, an ensemble may be no better calibrated than any single one. Worth running on a ≥500-clip set before deploying anything.

No fine-tuning baseline. All three judges are zero-shot. The natural comparison is Cosmos with LoRA fine-tuning on a 200-clip Waymo-labeled subset — based on similar literature, expect a 30-50 percentage-point lift on top-1 accuracy. The interesting question is whether fine-tuning also improves the calibration of $H(p)$ and AU, or whether the model just gets more confidently wrong.


11. Key references

Year Paper / Resource Relevance
2011 Houlsby et al., Bayesian Active Learning by Disagreement (BALD) The mutual-information formulation of epistemic uncertainty used in the TU = AU + EU decomposition
2017 Kendall & Gal, What Uncertainties Do We Need in Bayesian Deep Learning? Canonical reference for the aleatoric / epistemic split in deep models
2018 Depeweg et al., Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-Sensitive Learning The exact AU/EU decomposition we used
2024 Waymo, End-to-End Driving Dataset / Challenge The dataset and the 10-cluster taxonomy
2024 LanguageBind / Lin et al., Video-LLaVA One of the three judges
2024 NVIDIA, Cosmos-Reason 2 One of the three judges (Qwen3-VL-based)
2024 Allen AI, Molmo / Molmo2 One of the three judges
2017 Guo et al., On Calibration of Modern Neural Networks Background on reliability diagrams and the kind of post-hoc recalibration (Platt scaling, isotonic) that would be the next step for any of these three judges before production use