ML Evaluation Engineer Interview: A Self-Study Question Bank

April 26, 2026

ML Evaluation Engineer Interview: A Self-Study Question Bank

Last updated: April 2026

These are study questions I drew up while preparing for a senior ML evaluation engineering interview in the autonomous-vehicles space. Sharing in case they’re useful to anyone working through similar material.

The questions are written in second person — they’re the questions I asked myself. Each one has a collapsible suggested answer so you can self-quiz first, then check.

Topics covered:

Production inference optimization
VLM evaluation & self-supervised data pipelines
Rule-based to LLM-agent migration
LLM-as-a-Judge & golden sets
Annotation pipelines & vendor management
Production ML monitoring
VLM fine-tuning for video hazard detection
Sim-to-real robot manipulation with VLA models
Evaluation methodology deep dives
Domain curveballs

Sections 7 and 8 reference my published project blogs — SafetyGuardian (VLM hazard detection) and Sim-to-Real Robot Manipulation. Read those first if a question feels unmoored from context.

1. Production Inference Optimization

Q1. Decomposing a P50 latency reduction across graph compilation (e.g. PyTorch AOTI), low-precision quantization (FP8), and GPU-profiling-driven kernel/batching fixes — which lever typically moves the needle most, and which is a dead end?

Suggested answer

The headline number is usually cumulative across all three levers; per-lever attribution depends heavily on the workload. Segmentation models, large VLMs, and diffusion transformers all respond differently.
Profiling is the enabler — it tells you where AOTI compilation and FP8 quantization should be applied. AOTI/FP8 are the executors.
Cost framing: latency wins translate to GPU savings only if autoscaling can capture them. A 40% P50 win at the same RPS means fewer replicas only if cold-start and tail-latency constraints permit.

Q2. Why FP8 over INT8 or INT4 on a large VLM? What’s the calibration set, how do you verify accuracy didn’t regress, and what tends to break first when you turn it on?

Suggested answer

FP8 preserves dynamic range better for attention activations than INT8; INT8 PTQ tends to clip on long-tail tokens.
Process: calibrate on a representative sample of production prompts, then re-run the visual-grounding eval suite to confirm no regression.
Failure-mode escape hatch: mixed-precision — keep attention in FP16, FP8 the FFN.

Q3. PyTorch AOTI vs. TensorRT vs. torch.compile — when do you reach for AOTI? What workloads does it not help on, and which endpoints stay in eager?

Suggested answer

AOTI ahead-of-time compiles, removing Python overhead at serve time — wins when you have stable input shapes and want predictable latency.
TensorRT: use where vendor kernels dominate; AOTI: use where you want flexibility in the model code.
Stays eager: highly dynamic input shapes, control-flow-heavy preprocessing, low-volume endpoints where compile time doesn’t pay back.

Q4. Multi-endpoint serving stack at scale — what’s the autoscaling policy? How do you handle cold starts on a multi-billion-parameter VLM, and what’s a reasonable tail-latency SLO?

Suggested answer

Stack typically: Kubernetes + ArgoCD + Helm + Terraform.
Cold-start on a multi-billion-param VLM is the hard problem — keep min-replica > 0; pre-warm via readiness probes.
Tail latency: at-scale eval has the same shape — keep throughput high without paying for idle GPUs.

Q5. Research-to-production parity: what’s the canonical bug class that infra-as-code (Terraform-managed infra, container-pinned envs) prevents? What goes wrong before parity is enforced?

Suggested answer

Bug class: research notebook works, production endpoint diverges due to preprocessing/postprocessing drift, version skew on the model artifact, library version mismatch.
Terraform-managed infra + container-pinned envs is what closes the loop.
Same problem class shows up as research-to-eval-to-on-vehicle parity in any safety-critical deployment.

2. VLM Evaluation & Self-Supervised Data Pipelines

Q6. VLM visual grounding evaluation methodology: walk through the metric. How do you handle ambiguous prompts? What’s the eval set size, how is it constructed, and what’s the acceptable failure mode?

Suggested answer

Build a structured eval set across scene types — diversity is the metric’s defense against averaging away failures.
Handle ambiguous prompts via multi-region acceptable answers + weighted scoring.
The hard part is defining what correct means before measuring it. The load-bearing skill is structured eval-set construction.
Open metric choice: IoU on bbox, pointing accuracy, top-1 region selection — depends on the use case.

Q7. VLM-as-a-Judge + captioning self-supervised pipeline — how do you prevent the judge from rubber-stamping the captioner’s mistakes (judge/generator correlation)? How do you measure that the expanded data actually improved the model rather than just inflating the dataset?

Suggested answer

Risk to anticipate: judge and captioner share blind spots → the loop confirms its own bias.
Mitigations: judge uses a different base model than the captioner; periodic human spot-check of judge agreement; track data-diversity metrics, not just volume.
Measurement: held-out eval set frozen before the expansion ran; compare model trained on original vs. expanded.

Q8. Self-supervised expansion vs. human annotation — where’s the cost/quality crossover where you’d still pay for humans?

Suggested answer

Self-supervised wins when judge precision is high and failure modes correlate with already-covered data slices.
Humans still required for: novel scene types, edge cases, the judge calibration set itself.
Frame as budget allocation: humans on calibration set + long-tail; pipeline on the bulk.

3. Rule-Based to LLM-Agent Migration

Q9. Rule-based → LLM-agent transition at production scale (millions of messages/day) — what’s the riskiest user-facing failure mode to control during rollout, and how do you ramp traffic safely?

Suggested answer

Gradual traffic ramp via experimentation framework; rule-based fallback retained for high-confidence intents during transition.
Riskiest failure mode: tool-use hallucination (calling the wrong tool with confident wrong arguments) — gated behind golden-set tool-use accuracy thresholds before each ramp step.

Q10. SFT with implicit feedback (acceptance signal): what’s the bias risk of training on accepted suggestions only? How would you correct for selection bias / counterfactual logging (IPS)?

Suggested answer

Real risk: training only on accepted suggestions amplifies the existing distribution.
Mitigations: log all suggestions (accepted + rejected); use rejection signal as negative; periodic exploration injection.
Stronger correction: counterfactual logging with IPS reweighting.

4. LLM-as-a-Judge & Golden Sets

Q11. Tool-use accuracy — how do you decompose this metric? Per-tool precision/recall on selection vs. argument correctness vs. end-to-end task success? Where do the metrics disagree?

Suggested answer

Decomposition: per-tool precision/recall on tool selection + per-tool argument correctness + end-to-end task success.
Disagreement pattern: high per-tool precision, low end-to-end success → planning/orchestration is the bottleneck, not individual tools.
The mental model transfers to evaluating any complex multi-step agent.

Q12. LLM-as-a-Judge validation: how do you validate each judge against the golden set — what precision/recall threshold gates deployment, and how often do judges drift and need recalibration?

Suggested answer

Each judge validated against a golden set with precision/recall metrics before deployment.
Recalibration trigger: judge precision on a rolling holdout slice falls below threshold.
Cadence: event-driven on prod data drift, plus scheduled audits.

Q13. Golden-set construction: who labels it, how big, how do you handle label disagreement, and how often is it refreshed?

Suggested answer

Sizing rule of thumb: small enough to be high-quality, large enough to give precision/recall stable to ~2 decimal places.
Refresh: when the underlying capability set or tool catalog changes.
Disagreement: SME adjudication; track inter-rater rate as a quality signal.

Q14. Offline judge scores vs. A/B tests on adoption metrics — how do you bridge the gap when they disagree? Which do you trust?

Suggested answer

Disagreement is the interesting case: offline judge can be over-strict (misses user tolerance) or over-loose (misses friction).
Trust hierarchy: A/B on adoption metric is ground truth for user impact; offline judge is the fast iteration loop. Use offline to filter candidates, A/B to confirm.

5. Annotation Pipelines & Vendor Management

Q15. A large jump in F1 on a domain (e.g., 0.65 → 0.9) — was the gain from data quality, label quality, model architecture, or label-space redefinition?

Suggested answer

Most often: data-centric improvement and label-space cleanup, not model architecture.
Annotation pipelines that filter noisy public data against an internal product taxonomy can move the needle dramatically without touching the model.

Q16. Vendor annotation pipelines: how do you measure annotator quality and inter-rater agreement? What’s your QA loop?

Suggested answer

Vendor management: per-batch acceptance criteria; sampled audits.
IRR: Cohen’s kappa for two raters, Fleiss’ kappa or Krippendorff’s alpha for >2 raters; weighted kappa for ordinal labels.
For more subjective domains (e.g., “comfort” labels in driving), formal kappa is non-optional — spot-check agreement isn’t enough.

6. Production ML Monitoring

Q17. Quantifying ML-driven savings (e.g., a fraud-savings number) — how is that number computed? Counterfactual baseline, A/B holdout, or pre/post comparison?

Suggested answer

Default framing: counterfactual based on baseline rate × volume × intervention rate.
Acknowledge limitations: selection bias in pre-period, drift in baseline.
Strongest method: A/B holdout if business risk allows. Pre/post is the weakest because confounders are unbounded.

Q18. Time-sensitive model monitoring — what specifically do you monitor (input drift, output drift, label delay), what fires alerts, and what’s a reasonable false-alarm rate?

Suggested answer

“Time-sensitive” implies temporal drift: input feature drift (e.g., transaction patterns) and label arrival delay (labels confirmed days later).
Alert design: thresholds too tight → alert fatigue; too loose → missed events. Calibrate against historical drift events.

7. VLM Fine-Tuning for Video Hazard Detection

Questions in this section are grounded in the SafetyGuardian project blog — read that first for context on dataset size, LoRA config, and the Cosmos-Reason fine-tune.

Q19. Optuna hyperparameter search: 20 trials across LR (1e-5 to 5e-4), batch size, gradient accumulation, and LoRA rank. Why a high learning rate for LoRA fine-tuning, and how confident can you be that 20 trials is enough to claim a global optimum on a few hundred samples?

Suggested answer

High LR is OK for LoRA because adapter weights start at zero and the base model is frozen — effective gradient magnitude is much smaller than full fine-tuning.
Honest: 20 trials on a small dataset is enough to find a working config, not a global optimum. Multi-seed runs at top-3 trial configs would estimate variance.
The structure is right (log-scale LR + joint search over batch / grad-accum) even when trial budget is tight.

Q20. LoRA targeting all 7 modules (q/k/v/o + gate/up/down) at high rank with no dropout. Why all-modules over attention-only, why a large rank for a few hundred samples, and why no dropout?

Suggested answer

All 7 modules: visual-reasoning fine-tuning needs to adapt both attention (where the model attends) and FFN (what concepts it represents). Attention-only LoRA underfits multimodal tasks.
Large rank justified by hyperparameter search; 1:1 alpha/rank ratio is conventional.
Dropout 0: at small dataset size, regularization comes from the small effective rank itself plus early stopping. Adding dropout on top hurt val loss in trials.

Q21. Mean token accuracy plateauing at ~0.52 after convergence. That’s not high — how do you defend it as “good enough” for a safety-warning system, and what’s the relationship between token accuracy and end-to-end hazard-classification correctness?

Suggested answer

Token accuracy is misleading for structured output. A format like HAZARD: <type> | SEVERITY: <level> | ACTION: <instruction> has tokens that don’t matter (literal “HAZARD:”) and tokens that do (the type). 0.52 token-acc still allows correct slot extraction.
The right metric is slot-level F1 on hazard type and severity — not token accuracy. Token accuracy was a training-loop diagnostic, not the deployment metric.
For a production safety system, report end-to-end correctness on the validation set against human-rated ground truth, not training token-acc.

Q22. Filtering protocol with the same model family as judge of generator (zero-shot Cosmos-Reason as auto-reviewer for Cosmos-Predict-generated videos) plus human ratings. Doesn’t this create a self-confirming loop — and how is it different from the judge/generator-correlation problem in self-supervised data expansion (Q7)?

Suggested answer

Acknowledge the risk directly: same family means shared inductive biases. Catching the worst issues requires human review as the loop-breaker.
The auto-filter is a first pass — rejects egregious cases (totally blank frames, wrong scene). Human is the precision filter.
Same principle as Q7: human spot-check of judge agreement is non-optional when judge and generator share a backbone.
Better: use a different VLM family (e.g., Qwen-VL or LLaVA) for the auto-filter to break the family correlation.

Q23. Frame resize 1280×720 → 400×400 with max_pixels=160000 for ~3× token efficiency. What’s lost in visual fidelity, and how do you validate that the smaller frames don’t destroy fine-grained hazard detection (small ice patches, distant pedestrians)?

Suggested answer

Pixel reduction ~5.8×. Lost: distant pedestrians (small bbox), thin ice patches (texture-level), license-plate-readable text.
Mitigated by: hazard categories being coarse-grained (pedestrian present yes/no, not “exact pose”); 5 frames per video give temporal redundancy.
Right validation: explicit resolution ablation. If you skipped it, say so honestly.

Q24. End-to-end latency around 1s (VLM inference ~0.65s + TTS) for an elderly pedestrian warning use case — is 1s acceptable? What’s the failure cost of a 1.5s warning vs. a 0.5s one, and how would you redesign for sub-500ms?

Suggested answer

Honest framing: 1s is fine for advisory warning (notify when a hazard appears in next 5–20s window) — not fine for reactive obstacle avoidance (need <200ms).
Use case: “look-ahead heads-up display,” not “emergency brake.”
Sub-500ms redesign: smaller VLM (1B or distilled), streaming TTS overlap with inference, edge-deployed tiny detector in parallel for hard fail-safes.
VLM inference dominates — vLLM batched serving, lower max_pixels, or distilled model.

Q25. On-device inference impractical (~56s/frame on a phone) — for a real product that means a network dependency. How do you reason about availability/safety when the network is the SPOF?

Suggested answer

Real-product requirements: graceful degradation (smaller on-device classifier as fallback for the most critical hazards); buffered last-good warning; explicit “system unavailable” indicator so the user doesn’t assume “no warning = no hazard.”
Honest scope: hackathon prototypes are typically cloud-only by design; production architecture is hybrid.

Q26. Stopping at 20 epochs with linear warmup and a 90/10 train/val split with no held-out test set. With ~27 val samples, your val-loss signal is noisy — how do you decide to stop at epoch 20 rather than 8 or 30, and how do you defend the no-test-set choice?

Suggested answer

Stopping criterion: train/val loss curves plateau (visible in W&B), plus generation samples (logged every 2 epochs) showing format compliance.
Small val set: signal is noisy — rely on directional trend across multiple epochs, not single-checkpoint eval.
No held-out test set is a known weakness — production needs a frozen test set untouched during all hyperparameter search.
Honest finish: Optuna trials selected on val loss → the “best” config is val-set-overfit by definition.

8. Sim-to-Real Robot Manipulation with VLA Models

Questions in this section are grounded in my Sim-to-Real GR00T project blog. Read that first for context on the policy architecture, training mix, and generalization tests.

Q27. GR00T N1.6 uses Cosmos-Reason-2B as the vision backbone with 32 DiT layers and action chunking over 16 timesteps. Why action chunking over autoregressive action prediction, and what’s the failure mode when the chunk horizon doesn’t align with task phases (e.g., grasp moment)?

Suggested answer

Why chunk: autoregressive in action space at 30Hz means 30 sequential model calls per second of robot motion — latency-prohibitive for a 2B+ DiT model. Chunking amortizes inference.
Failure mode: when the chunk crosses a phase boundary (approach → grasp), the predicted chunk may be inconsistent (“open gripper at step 8, close at step 12” is hard to plan if uncertainty spikes mid-chunk).
Standard mitigation: receding-horizon — predict 16, execute first 8, replan.

Q28. Synthetic > real co-training (counterintuitive result): sim+70 Cosmos-augmented = 2/3 vials, sim+real co-training (5–50 episodes) = 1/3 vials. Why does synthetic augmentation outperform real co-training? Doesn’t this contradict the conventional wisdom that real data > synthetic?

Suggested answer

Likely explanation: with only 5–50 real episodes, real co-training adds noise without coverage. 70 Cosmos episodes adds diversity at scale that the policy needs more than fidelity.
Honest caveat: “real > synthetic” is a function of quantity at parity. At low real-data quantity, high-quality synthetic wins. The crossover point is the interesting question.
Same lesson applies to scenario coverage in any domain — synthetic-generated rare scenarios may beat sparse real captures of the same scenarios.

Q29. Domain randomization axes: lighting, HDRI, camera pose, object positions, pre-placement probability. Sim-only got high sim success but poor real success — so the sim-to-real gap was the bottleneck, not sim performance. Which DR axis tends to move real-world transfer most, and which is a placebo?

Suggested answer

Without leave-one-out ablations, attribution is qualitative.
Strongest qualitative signal: lighting/HDRI randomization tends to give robustness to lighting variation.
Probable placebo: extreme camera-pose offsets — too large hurts training (forces the policy to learn invariances at the cost of accuracy at the nominal pose).
Right experiment: leave-one-out per DR axis.

Q30. Generalization tests: OOD instruction → jerky motion; novel object (yellow rack → blue cup) → robot still targeted the original location. The model didn’t actually use the language conditioning for object grounding — it memorized spatial priors. How do you diagnose that, and what would the next iteration change architecturally?

Suggested answer

Diagnosis: novel-object swap — robot still went to original location, ignoring the new visual.
Root cause: behavior cloning on a small dataset (75–145 episodes) can’t disentangle language → object identity without explicit grounding supervision. The model latches onto spatial priors because they’re more reliable than the language signal at that scale.
Architectural fix: contrastive vision-language objective at training time, OR pretrained VLM grounding head with action decoder fine-tuned on top.
Eval connection: this is exactly the kind of generalization failure a learned evaluator should catch — does the policy use its inputs, or pattern-match to the training distribution?

Q31. Uncertainty estimation on a chunked DiT policy: for a real deployment, how do you add uncertainty estimation — and how does that uncertainty feed an evaluation signal?

Suggested answer

Sources of uncertainty: aleatoric (action noise from teleoperator inconsistency) vs. epistemic (out-of-distribution observation).
Approach for a DiT chunked policy: ensemble at the chunk level — sample multiple action chunks at different diffusion seeds, measure variance. High variance signals epistemic uncertainty.
Eval-of-eval connection: that variance becomes a feature for the learned evaluator — “policy was uncertain on this scenario, escalate for human review.”

9. Evaluation Methodology Deep Dives

Q32. Evaluation-of-evaluation: how would you design a system to know whether your learned evaluator is correct? What’s the meta-eval loop, and how do you avoid infinite regress?

Suggested answer

Meta-eval loop: hold out a human-labeled gold subset that the learned evaluator never sees during development; measure evaluator precision/recall against it.
Avoid infinite regress by: anchoring on humans at the bottom of the stack — humans evaluate the evaluator, evaluator evaluates the model.
Operational: budget ~5% of human labeling for evaluator calibration permanently, not as a one-time project.

Q33. [CASE — 10–15 min] Golden-set framework for AV behavior eval: how would you bootstrap one? Who labels, what’s the taxonomy, how do you handle long-tail scenarios that are release-blocking but rare in the dataset?

Suggested answer (6-stage walkthrough)

Problem framing. Discrete behavioral events (cut-ins, hard braking, lane-change-without-signal) vs. continuous quality (comfort, lane-keeping deviation). Get the taxonomy from the team — don’t assume which subtypes are release-blocking.
Ground-truth source. Stratified sampling: every rule-flagged event in fleet logs becomes a golden positive (cheap labels). Augment with human-labeled long-tail mined from fleet replay (expensive, targeted). Ultra-rare scenarios from synthetic generation.
Method. Per-scenario-type sub-golden-sets so we measure precision/recall by behavior category, not aggregate. Aggregate metrics hide tail failures. Each sub-set sized to give CI tight enough to detect a 2pp regression.
Eval-of-eval. Hold out ~5% of human labels permanently for evaluator calibration — never seen during evaluator training. Measure evaluator precision/recall against that frozen set, not against the model under test.
Production loop. Refresh quarterly OR on a drift trigger (fleet distribution shift, new ODD region). Track evaluator-vs-human agreement as a control chart over time.
Failure modes. (a) Labeler bias if a single team owns the golden set → rotate labelers + measure inter-rater. (b) Synthetic edge cases drifting from real-world distribution → periodic real-fleet calibration. (c) Golden-set staleness as behavior changes across model versions → tie refresh schedule to model versions, not calendar.

Q34. LLM-as-a-Judge calibration drift: same prompts, same model version — judges still drift. What’s your monitoring + recalibration cadence, and at what signal do you re-train vs. re-prompt vs. replace?

Suggested answer

Drift sources: model version updates, prompt template changes, distribution shift in inputs.
Monitoring: rolling-window precision on a fixed calibration set, daily.
Recalibration ladder: re-prompt (cheap) → few-shot example refresh (medium) → fine-tune judge (expensive) → replace judge architecture (rare).
Trigger: judge precision on calibration slice drops > 3pp week-over-week.

Q35. [CASE — 10–15 min] Agentic workflow for evaluating a complex driving scenario (e.g., unprotected left turn): walk through how you’d chain VLM perception → retrieval → structured reasoning → metric. Where’s the human in the loop?

Suggested answer (6-stage walkthrough)

Problem framing. Multiple correctness criteria — yielded correctly, accepted gap was safe, acceleration was comfortable, stayed in lane, completed in reasonable time. Single-score collapses information the release manager needs; per-criterion + aggregated is right.
Ground-truth source. Human raters score per-criterion on a sampled set; that sample is the calibration set. Inter-rater agreement (Cohen’s kappa or Krippendorff’s alpha) measured before trusting it as ground truth.
Method (the agentic chain). (a) VLM perception extracts agents/lanes/signals from the clip. (b) Retrieval over similar past clips gives a behavioral baseline. (c) Structured CoT with per-criterion prompt templates. (d) Per-criterion score + aggregated scenario score with explicit weights.
Eval-of-eval. Per-criterion judge precision/recall vs. human labels on calibration. Watch judge-pair correlation across criteria — if all criteria correlate at >0.95, you’ve collapsed to a single dimension and the per-criterion structure adds no signal; simplify.
Production loop. Log the CoT trace alongside the score so debugging “why did this clip block release” is possible. Sample-based human review of release-blocking verdicts before the metric actually gates a deployment.
Failure modes. (a) VLM perception missing a subtle agent (motorcyclist in blind spot) → false-greenlight; mitigate with multi-frame, higher-res clip, redundant perception. (b) CoT criteria disagree on overall verdict → use disagreement itself as escalation. (c) Retrieval pool contaminated with the AV’s own past behavior → judge becomes self-reinforcing; exclude same-model clips from retrieval.

Q36. Inter-rater reliability on subjective driving labels (e.g., “this lane change was uncomfortable”): which agreement metric, and what’s the threshold to call a label trustworthy?

Suggested answer

Categorical: Cohen’s kappa (2 raters), Fleiss’ kappa or Krippendorff’s alpha (>2 raters).
Ordinal (comfort 1–5): weighted kappa or ICC.
Threshold: kappa > 0.6 substantial, > 0.8 strong. Under 0.6 means the labeling rubric needs work, not the raters.

Q37. Block-or-greenlight metric design: what statistical guarantees do you need before a learned metric can gate a software release? How do you communicate a probabilistic eval signal to release managers used to deterministic pass/fail?

Suggested answer

Need: estimated false-positive rate (greenlights a regression) and false-negative rate (blocks a good release).
Approach: confidence intervals on the metric using bootstrap; require lower-bound > threshold for greenlight.
Communication: translate to release-manager language — “this metric, at this CI, says we are 95% confident regression is below X%.”

Q38. Video understanding for at-scale eval: tradeoff between a frozen big VLM scoring per clip vs. a fine-tuned smaller model. What’s the cost/latency/accuracy frontier?

Suggested answer

Frozen big: zero training cost, broad capability, expensive inference, drift-resistant on prompt change.
Fine-tuned small: cheap inference, narrow capability, requires labeled data + retraining cadence.
Frontier: hybrid — small model for high-volume bulk; frozen big for sampling-based audit + edge-case scoring.

Q39. [CASE — 10–15 min] Rule-based → learned eval transition: how do you sunset rules without losing the safety floor they provide? What’s the migration playbook?

Suggested answer (6-stage walkthrough)

Problem framing. Rules: cheap, deterministic, known-incomplete (don’t catch behaviors they weren’t written for). Learned eval: expensive, probabilistic, known-broader. The migration must not lose the safety floor while extending coverage.
Ground-truth source. Rule-based eval on the historical fleet is the initial ground truth on rule-coverage scenarios — every rule-flagged event is a known positive, every non-flagged event in rule-coverage scope is a known negative. Human review supplies ground truth for non-rule-coverage scenarios.
Method. (a) Train learned eval to match rule-eval on rule-coverage scenarios — the consistency check. (b) Extend learned eval to non-rule-coverage scenarios — the value-add. (c) Validate non-rule-coverage extensions against human review on a stratified sample.
Eval-of-eval. Set thresholds: rule-coverage agreement > X% (target ~99%) before learned eval is trusted in rule-coverage cases. Non-rule-coverage agreement with humans > Y% (lower bar, ~90%, because human IRR caps the ceiling).
Production loop. Phased rollout — keep rules running in parallel as a “second opinion” for the first N months. Learned eval only blocks releases when both rule and learned eval agree on a regression, OR when learned eval flags a regression in non-rule-coverage scope. Disagreements escalate to human review. Sunset rules per-category as trust accrues, not all-at-once.
Failure modes. (a) Learned eval overfits to the rule pattern and doesn’t generalize → measure on held-out non-rule-coverage scenarios, not just rule recovery. (b) Safety-critical rules sunsetted too early because aggregate agreement looked good but tail cases failed → sunset by category with explicit per-category trust thresholds. (c) Rule and learned eval drift apart over time → schedule periodic agreement audits tied to model versions.

10. Domain Curveballs

Q40. AV scenario-based evaluation: name the public-domain landmarks. What should you read?

Suggested answer

Waymo’s safety-case methodology and behavioral catalog.
NHTSA behavioral taxonomies.
Recent learned-planner literature: MotionLM, Wayformer, VAD.
Public datasets: nuScenes, Waymo Open Motion Dataset.

Q41. Define a “cut-in” precisely enough that two annotators would agree. What are the edge cases?

Suggested answer

Working definition: an adjacent-lane vehicle entering the ego lane within a TTC threshold (e.g., < 3 sec) ahead of ego, without sufficient gap.
Edge cases: zipper merges (intentional and protocol-following — not really a cut-in?); slow-speed urban (TTC threshold needs adjustment); cyclist/scooter cut-ins (different hazard class).
Real production taxonomies have 5+ subtypes — start from the team’s existing definitions, don’t invent your own.

Q42. Custom agent vs. DSPy / LangChain / CrewAI for an eval-of-eval workflow: what would you pick and why?

Suggested answer

Custom: full control over prompt templating, tool registry, retry/fallback logic.
DSPy: treats the prompt as something to optimize against an eval metric — maps directly to learned-evaluation workflows. Strong fit when eval metrics drive the optimization.
LangChain: ecosystem; con: abstraction overhead.
CrewAI: multi-agent orchestration — overkill unless the eval workflow has multiple cooperating agents.
For an eval-of-eval pipeline, DSPy is the natural starting point.

Closing Notes

If a question landed without your having a clean mental model for the answer, that’s the signal — the fastest way to get unstuck is usually to write out your own version of the answer first, then compare. The point of the toggle isn’t to give you the answer; it’s to let you check your own.

The two CASE questions (Q33, Q35, Q39) are deliberately framed as 10–15 minute structured answers. If you’re prepping for a senior-IC interview, practice them out loud against a timer — written answers feel different from spoken ones, and the case format rewards explicit stage-by-stage structure over free-form depth.

ML Evaluation Engineer Interview: A Self-Study Question Bank

ML Evaluation Engineer Interview: A Self-Study Question Bank

1. Production Inference Optimization

2. VLM Evaluation & Self-Supervised Data Pipelines

3. Rule-Based to LLM-Agent Migration

4. LLM-as-a-Judge & Golden Sets

5. Annotation Pipelines & Vendor Management

6. Production ML Monitoring

7. VLM Fine-Tuning for Video Hazard Detection

8. Sim-to-Real Robot Manipulation with VLA Models

9. Evaluation Methodology Deep Dives

10. Domain Curveballs

Further Reading

Production inference (§1)

LLM-as-a-Judge & evaluation methodology (§2, §4, §9)

Annotation pipelines & inter-rater reliability (§5)

VLM fine-tuning & LoRA (§7)

Sim-to-real, VLA models, and action chunking (§8)

Uncertainty in policies (§8 Q31)

Autonomous-vehicle eval & planner literature (§10)

Closing Notes