Autonomous Systems: Vision-Language-Action Models
March 28, 2026
Vision-Language-Action Models: A First-Principles Guide
A bottom-up explanation from visual perception to embodied action Last updated: March 2026
1. From Vision-Language to Action
The progression
Computer vision has evolved through a clear sequence of capability expansions:
| Era | Models | What they do | Output |
|---|---|---|---|
| Vision-only | ResNet (2015), ViT (2020) | Classify images, detect objects | Class labels, bounding boxes |
| Vision-Language | CLIP (2021), LLaVA (2023) | Connect images to natural language | Text descriptions, visual Q&A |
| Vision-Language-Action | RT-2 (2023), Octo (2024) | Perceive, reason, and act in the physical world | Motor commands, trajectories |
Each step adds a new modality. A vision model like ViT takes a photograph and outputs “there is a red cup on the table.” A vision-language model (VLM) like LLaVA can answer “where is the red cup relative to the plate?” A vision-language-action model (VLA) can take the instruction “pick up the red cup” and output the sequence of motor commands to actually do it.
Grounding: from words to physics
Grounding is the process of connecting abstract language descriptions to concrete physical referents. The instruction “pick up the red cup” requires two kinds of grounding:
- Visual grounding — identify which pixels in the camera image correspond to the red cup (object recognition + spatial localisation)
- Action grounding — translate the concept of “pick up” into a sequence of motor commands: move end-effector above the cup → lower → close gripper → lift
VLMs already solve visual grounding — they can locate objects, describe spatial relations, and follow complex instructions. The missing piece is action grounding: mapping that understanding to physical control signals. This is exactly the gap that VLAs fill.
Why VLMs are a natural starting point
VLMs pre-trained on internet-scale data (billions of image-text pairs) arrive with an enormous library of world knowledge: they know what thousands of objects look like, how they relate spatially (“the fork is to the left of the plate”), and can parse compositional instructions (“put the green block on top of the red block, then move both to the corner”). This knowledge transfers directly to robotics. A VLA doesn’t need to learn what a “cup” is from scratch — it inherits that from the VLM backbone and only needs to learn the mapping from perception to motor output.
2. VLA Architecture
The canonical pipeline
Nearly all modern VLAs follow a three-stage pipeline:
Camera Image(s) ──→ [Vision Encoder] ──→ Visual Tokens
↓
Language Instruction ──→ [Tokenizer] ──→ Text Tokens ──→ [Language Model Backbone] ──→ Hidden States
↓
[Action Head] ──→ Robot Actions
Each component has a distinct role:
Vision encoder
The vision encoder converts raw camera images into a sequence of visual tokens — dense vector representations that the language model can process alongside text. Most VLAs use a pre-trained Vision Transformer (ViT), typically frozen or lightly fine-tuned:
- SigLIP (used in OpenVLA, pi0): a sigmoid-loss variant of CLIP’s vision encoder that produces better-calibrated visual features
- DINOv2 (used in some Octo variants): self-supervised ViT that captures rich spatial features without requiring text supervision
Concretely, a 224×224 image is split into 14×14 patches, each embedded as a 1024-dimensional vector, yielding 196 visual tokens. These tokens are projected into the language model’s embedding space via a learned linear layer or MLP.
Language model backbone
The language model backbone is typically a pre-trained large language model (LLM) — PaLM-E, Llama 2, Gemma, or similar. It processes a combined sequence of visual tokens interleaved with tokenised language instructions. The LLM provides:
- Instruction following: parsing “put the apple in the bowl” into a structured plan
- Reasoning: determining which object is the apple, where the bowl is, what grasp strategy to use
- World knowledge: knowing that apples are graspable, bowls are containers, and the apple should go inside the bowl
The LLM outputs a sequence of hidden state vectors, one per token position. The hidden states at designated output positions are passed to the action head.
Action head
The action head maps the LLM’s hidden states to robot actions. This is where the VLA’s output becomes physical. There are three main design choices:
| Approach | How it works | Pros | Cons |
|---|---|---|---|
| Direct regression | MLP maps hidden states to continuous action values (e.g., 7 floats for 7-DOF arm) | Simple, precise | Unimodal — can only predict one action, problematic when multiple valid actions exist |
| Discrete action bins | Each action dimension is quantised into bins (e.g., 256); LLM outputs bin indices as tokens | Leverages LLM’s token prediction; simple training | Quantisation error; resolution limited by bin count |
| Diffusion-based | Iteratively denoises random noise into action sequences, conditioned on hidden states | Handles multi-modal distributions; smooth trajectories | Slower inference (requires multiple denoising steps) |
Concrete walkthrough
Consider a robot arm facing a table with several objects. The input is:
- Image: camera view showing an apple, a banana, and a bowl on a table
- Instruction: “put the apple in the bowl”
The forward pass proceeds as:
- Vision encoder (SigLIP) converts the camera image into 196 visual tokens capturing the spatial layout of all objects
- Tokenizer converts “put the apple in the bowl” into text tokens
- LLM backbone (e.g., Llama 2 7B) processes the concatenated sequence [visual tokens, text tokens]. Its internal attention identifies the apple (red, round, left side of image), the bowl (concave, right side), and infers the required motion direction (left → right, then down)
- Action head takes the LLM’s final hidden states and outputs a sequence of end-effector poses:
[(x₁, y₁, z₁, roll₁, pitch₁, yaw₁, gripper₁), ..., (xₙ, yₙ, zₙ, rollₙ, pitchₙ, yawₙ, gripperₙ)]
The robot executes these actions, physically moving the apple into the bowl.
3. Key Models
RT-2 (Robotics Transformer 2, Google DeepMind, 2023)
RT-2 was the first model to demonstrate that a large VLM can be directly fine-tuned into an effective robot controller.
| Aspect | Detail |
|---|---|
| Base model | PaLM-E (12B) or PaLI-X (55B) |
| Key insight | Represent robot actions as text tokens in the LLM’s existing vocabulary |
| Action format | 7-DOF actions encoded as integer strings: "1 128 91 241 1 128 0" where each number is a discretised action dimension |
| Training data | Robot episodes from Google’s fleet + internet-scale image-text data |
The elegance of RT-2 is that it requires no architectural changes to the VLM — actions are just another kind of text output. The model learns to “speak robot” the same way it learned to speak English.
Results: RT-2 achieves 2× improvement on unseen objects over its predecessor RT-1 (62% vs 32% success rate), demonstrating that VLM pre-training enables strong generalisation. It can manipulate objects it has never seen in robot training (e.g., picking up a specific toy figure) because the VLM backbone recognises those objects from internet data.
Octo (UC Berkeley, 2024)
Octo is an open-source generalist robot policy designed for multi-embodiment control.
| Aspect | Detail |
|---|---|
| Architecture | Transformer-based (not derived from an LLM) |
| Training data | 800K episodes from the Open X-Embodiment dataset |
| Action head | Diffusion-based — generates action chunks via iterative denoising |
| Multi-embodiment | Supports different robots via task-specific readout heads |
Octo’s diffusion action head is particularly important: it naturally handles multi-modal action distributions — situations where multiple valid actions exist (e.g., you can reach around either side of an obstacle). The readout heads are swappable, allowing the same backbone to control a WidowX arm, a Franka Panda, or other robots by changing only the final output layer.
OpenVLA (Stanford/Berkeley, 2024)
OpenVLA scales the VLA concept to 7 billion parameters, showing that bigger VLM backbones yield better robot controllers.
| Aspect | Detail |
|---|---|
| Base model | Llama 2 7B + SigLIP + DINOv2 dual vision encoders (via the Prismatic VLM backbone) |
| Training data | Open X-Embodiment dataset (970K robot episodes) |
| Action format | Each of 7 action dimensions discretised into 256 bins |
| Key finding | Scaling the VLM backbone improves manipulation success rate |
OpenVLA demonstrates a 16.5% absolute improvement over RT-2-X on WidowX manipulation tasks, providing evidence that the scaling laws observed in language models apply to robotic control as well. As an open-source release (weights, code, and data), it established a common baseline for VLA research.
pi0 (Physical Intelligence, 2024)
pi0 targets dexterous manipulation — tasks requiring fine motor control like folding laundry or assembling objects.
| Aspect | Detail |
|---|---|
| Architecture | VLM backbone + flow matching action head |
| Action generation | Flow matching (a continuous-time variant of diffusion) for smooth, precise trajectories |
| Pre-training | Internet-scale image-text data + large-scale robot data |
| Fine-tuning | Task-specific fine-tuning for dexterous manipulation |
The flow matching action head is a key distinction: instead of the iterative denoising steps of diffusion, flow matching learns a continuous vector field that transports samples from noise to actions along straight(er) paths. This yields faster inference and smoother generated trajectories — critical for dexterous tasks where jittery motions would cause failures.
4. Action Tokenisation
The fundamental challenge
Robot actions are continuous — joint torques, end-effector velocities, and gripper apertures are real-valued quantities with arbitrary precision. Language models operate on discrete tokens from a finite vocabulary. Bridging this gap is the core technical challenge of VLA design.
Discretising continuous actions
The simplest approach, pioneered by RT-2: bin each action dimension independently.
For a 7-DOF robot arm, each action is a vector $\mathbf{a} = (a_1, a_2, \ldots, a_7)$ where each $a_i \in [a_i^{\min}, a_i^{\max}]$. Discretisation maps each continuous value to one of $K$ bins:
\[\text{bin}(a_i) = \left\lfloor \frac{a_i - a_i^{\min}}{a_i^{\max} - a_i^{\min}} \cdot (K-1) \right\rfloor\]With $K = 256$ bins (OpenVLA’s choice), the quantisation error is at most $\frac{a_i^{\max} - a_i^{\min}}{2 \times 255}$ per dimension — typically sub-millimetre for typical robot workspace ranges.
RT-2 maps these bin indices to existing integer tokens in the LLM vocabulary (tokens for “1”, “128”, “91”, etc.), requiring no vocabulary modification.
Action chunking (ACT)
Action Chunking with Transformers (ACT) (Zhao et al., 2023) addresses a different problem: compounding errors.
When a policy predicts one action at a time, small errors accumulate. A 1° joint angle error per step becomes 10° after 10 steps. ACT predicts a chunk of $H$ future actions simultaneously:
\[\pi(\mathbf{o}_t) \rightarrow (\mathbf{a}_t, \mathbf{a}_{t+1}, \ldots, \mathbf{a}_{t+H-1})\]where $\mathbf{o}_t$ is the observation at time $t$ and $H$ is the chunk size (typically 10–100 steps).
Benefits:
- Reduced compounding error: the model plans further ahead, producing temporally coherent motion
- Smoother trajectories: consecutive actions within a chunk are jointly optimised
- Multi-modality via CVAE: ACT uses a Conditional Variational Autoencoder to sample different action chunks for the same observation, handling situations where multiple valid motions exist
Diffusion-based action generation
Diffusion Policy (Chi et al., 2023) models the action distribution as a diffusion process. Starting from pure Gaussian noise $\mathbf{a}^{(T)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, the model iteratively denoises to produce an action sequence:
\[\mathbf{a}^{(t-1)} = \text{denoise}_\theta(\mathbf{a}^{(t)}, \mathbf{o}, t) \quad \text{for } t = T, T-1, \ldots, 1\]where $\mathbf{o}$ is the observation (visual features + instruction embedding) and $\theta$ are learned parameters.
Concrete example: a robot needs to place a block that could go in either of two valid locations. A regression head would average the two locations (placing the block between them — a failure). A diffusion head naturally samples from the bimodal distribution, committing to one valid location per rollout.
Tradeoffs
| Method | Precision | Multi-modality | Speed | Training simplicity |
|---|---|---|---|---|
| Discrete bins | Limited by bin count | No (argmax over bins) | Fast (single forward pass) | Leverages LLM cross-entropy loss |
| Direct regression | High (continuous output) | No (unimodal) | Fast | Simple MSE loss |
| Diffusion / flow matching | High | Yes (samples from distribution) | Slow (10–100 denoising steps) | Requires noise schedule tuning |
| ACT (CVAE) | High | Yes (latent sampling) | Fast | Requires KL balancing |
5. Pretraining Recipes
VLAs are not trained from scratch. They follow a multi-stage recipe that reuses as much existing knowledge as possible.
Stage 1: Vision-language pre-training
The VLM backbone (e.g., Llama 2 + SigLIP) is pre-trained on billions of image-text pairs scraped from the internet. This stage teaches:
- Object recognition: what a cup, apple, screwdriver, etc. look like
- Spatial reasoning: understanding “on top of,” “to the left of,” “inside”
- Instruction parsing: following compositional natural language commands
- Common-sense physics: glasses are fragile, liquids spill, heavy objects require firm grasps
This is the most computationally expensive stage but is amortised across all downstream applications (not just robotics).
Stage 2: Robot data fine-tuning
The pre-trained VLM is fine-tuned on robot manipulation datasets. The primary dataset is Open X-Embodiment (2023):
| Statistic | Value |
|---|---|
| Total episodes | 1M+ |
| Robot embodiments | 22 |
| Tasks | 500+ manipulation skills |
| Data sources | 21 institutions worldwide |
During fine-tuning, the model learns to map visual understanding and instruction parsing to physical actions. Crucially, not all pre-trained knowledge needs to be re-learned — the VLM already understands “apple” and “bowl”; fine-tuning only teaches the action mapping.
The key insight: knowledge transfer
A VLA fine-tuned on Open X-Embodiment data has manipulated perhaps 100 distinct object categories in robot training. But through its VLM pre-training, it recognises thousands of objects. When encountering an object it has never physically manipulated (say, a rubber duck), the VLM backbone still identifies it correctly, and the action head can generalise the grasping strategy from similar objects seen during robot training.
Preventing catastrophic forgetting
A major risk during Stage 2 is catastrophic forgetting — the model “forgets” its VLM capabilities as it learns robot control. Mitigation strategies:
- Co-training: mix internet image-text data with robot data during fine-tuning (see Section 7)
- Low learning rate: fine-tune with 10–100× lower learning rate than pre-training
- Frozen encoder: keep the vision encoder weights frozen, only fine-tuning the LLM and action head
- LoRA: use parameter-efficient fine-tuning that modifies only a small fraction of weights
6. Generalisation
Generalisation is the central promise of VLAs. A robot that only works on objects, environments, and instructions it has seen during training is not useful. VLAs generalise along three axes:
Unseen objects
The VLA can manipulate objects absent from its robot training data because the VLM backbone recognises them from internet pre-training.
Example: RT-2 was never trained to pick up a specific action figure in robot data, but the VLM backbone has seen millions of images of action figures online. When given the instruction “pick up the action figure,” RT-2 correctly identifies and grasps it.
Quantitative evidence:
- RT-2: 62% success on unseen objects vs 32% for RT-1 (which lacks VLM pre-training)
- OpenVLA: 16.5% absolute improvement over RT-2-X on WidowX manipulation tasks
Unseen environments
Real deployment means different backgrounds, lighting conditions, table heights, and camera angles from training. VLAs handle this through:
- VLM robustness: VLMs trained on diverse internet images are inherently robust to visual domain shift
- Domain randomisation: randomising colours, textures, lighting, and camera poses during training
- Spatial generalisation: the VLM’s spatial reasoning transfers across environments (understanding “left of” doesn’t depend on the specific table)
Unseen instructions
VLAs can follow novel natural language commands that recombine known concepts in new ways:
| Training instructions | Novel test instruction |
|---|---|
| “pick up the red block” | “stack the blocks by colour” |
| “put X in the bowl” | “sort the fruits into the two bowls” |
| “move X to the left” | “arrange the objects in a line from smallest to largest” |
This compositional generalisation comes from the LLM backbone, which understands language compositionality from pre-training. The model has never executed “stack by colour” as a single skill, but it can decompose it into known primitives.
7. Co-training with Internet Data and Robot Data
The data scarcity problem
Robot data is expensive. Even the largest robot dataset (Open X-Embodiment) contains roughly 1M episodes — orders of magnitude less than the billions of image-text pairs used for VLM pre-training. If you fine-tune a VLM exclusively on robot data, it tends to lose its broad visual and linguistic capabilities.
The co-training solution
Co-training interleaves robot episodes with internet image-text pairs during fine-tuning. In each training batch, some examples are robot manipulation sequences (image → action), and others are standard VLM tasks (image → text description, visual question answering, etc.).
Training batch:
[Robot] Image of table → "pick up cup" → action tokens: 1 128 91 241 1 128 0
[Internet] Photo of park → "A golden retriever catching a frisbee in a sunny park"
[Robot] Image of drawer → "open the top drawer" → action tokens: 0 64 180 128 1 90 1
[Internet] Diagram of solar system → "The third planet from the sun is Earth"
...
Data ratio and loss weighting
Typical co-training mixes:
| Model | Robot data fraction | Internet data fraction |
|---|---|---|
| RT-2 | 50% | 50% |
| OpenVLA | ~100% robot (no explicit co-training) | Pre-trained backbone frozen |
| pi0 | Varies by stage | Internet data in pre-training, robot data in fine-tuning |
The data ratio matters: too much internet data and the model under-fits robot skills; too much robot data and it loses VLM knowledge. RT-2 found that equal mixing worked well, and critically, keeping internet data improved robot task success — not just language understanding. The hypothesis: internet data acts as a regulariser, preventing the model from over-fitting to the limited robot training distribution.
8. Embodiment-Agnostic Models
The vision
The ultimate goal is a foundation model for robotics — a single model that can control any physical embodiment (robot arms, quadrupeds, drones, humanoids) by learning shared representations of physical interaction. Just as GPT-4 handles English, French, and code with one model, an embodiment-agnostic VLA would handle a Franka Panda, a Boston Dynamics Spot, and a quadrotor with one model.
Octo’s approach
Octo addresses multi-embodiment control through modular tokenisation:
- Shared backbone: a single transformer processes all observations and generates shared hidden representations
- Embodiment-specific observation tokenisers: convert each robot’s sensor data (different camera configurations, proprioceptive states) into a common token format
- Embodiment-specific action readout heads: decode the shared hidden states into each robot’s native action space
Franka Panda (7-DOF arm) ──→ [Obs Tokeniser A] ──→ ──→ [Readout A] ──→ 7-DOF joint torques
[Shared Transformer]
WidowX (5-DOF arm) ──→ [Obs Tokeniser B] ──→ ──→ [Readout B] ──→ 5-DOF joint velocities
Open X-Embodiment dataset
The enabling dataset for embodiment-agnostic models is Open X-Embodiment (2023), a collaborative effort across 21 institutions:
| Property | Value |
|---|---|
| Robot types | 22 distinct embodiments |
| Data format | Standardised RLDS (Reinforcement Learning Datasets) format |
| Tasks | 500+ manipulation skills |
| Scale | 1M+ episodes |
Standardising the data format across robots was itself a significant engineering effort — different labs use different coordinate frames, control frequencies, camera setups, and action representations.
Current limitations
Embodiment-agnostic models remain far from matching specialist models:
- Action space mismatch: a 7-DOF arm and a quadruped have fundamentally different action spaces. Shared representations must abstract over these differences, losing embodiment-specific structure
- Negative transfer: training on data from dissimilar robots can hurt performance on any single robot. A quadruped’s locomotion data may not help (and may hinder) a tabletop manipulation model
- Observation heterogeneity: robots have different camera configurations (monocular, stereo, wrist-mounted), proprioceptive sensors, and tactile feedback. Unifying these into a common input format inevitably loses information
- Performance gap: Octo fine-tuned for a specific robot still underperforms specialist policies trained only on that robot’s data
9. Benchmarks
Simulation benchmarks
| Benchmark | Focus | Key features |
|---|---|---|
| SIMPLER (Li et al., 2024) | VLA evaluation in simulation | Google Robot + WidowX simulation with real-world-matched visual rendering; designed to predict real-world VLA performance |
| Language-Table | Language-conditioned tabletop manipulation | 2D pushing tasks with natural language instructions; 442K human demonstrations |
| CALVIN | Long-horizon manipulation | Sequences of 5+ sub-tasks; tests compositional task execution |
| RLBench | Diverse manipulation | 100 tasks with language variations; supports multiple observation types |
SIMPLER deserves special attention: it was specifically designed to evaluate VLAs by using visually realistic rendering that matches real-world conditions. Its key contribution is showing high correlation between simulation and real-world VLA performance, enabling cheaper evaluation loops.
Real-world evaluation
Real-world robot evaluation remains the gold standard but is fundamentally challenging:
- Protocol: typically 20–50 trials per task, measuring binary success rate
- Non-reproducible: exact conditions (lighting, object placement, calibration) vary between trials
- Expensive: each trial requires physical setup, execution, and reset — often manual
- Small sample sizes: statistical significance is hard to establish with 20-50 trials; a “5% improvement” may be within noise
Most VLA papers report real-world success rates across a curated set of tasks (e.g., “pick up X,” “put X in Y,” “open drawer”), with separate categories for seen vs unseen objects/instructions.
10. Connection to Autonomous Driving
The VLA framework extends naturally beyond tabletop manipulation to autonomous driving, where the “action” is a planned trajectory rather than a robot arm command.
EMMA as a driving VLA
EMMA (Waymo, 2024) is a direct instance of the VLA paradigm applied to driving:
| VLA Component | EMMA Implementation |
|---|---|
| Vision encoder | Gemini’s built-in visual encoder processing multi-camera images |
| Language backbone | Gemini 1.0 Nano-1 |
| Action output | Future trajectory waypoints $(x_t, y_t)$ in BEV space, encoded as text |
| Instruction input | Task-specific text prompts (e.g., “predict the ego trajectory”) |
EMMA uses chain-of-thought reasoning before action output: it first describes the scene, identifies critical objects, and states a high-level driving decision, then outputs trajectory waypoints. This mirrors VLA designs where the LLM “reasons” before generating actions.
Other driving VLAs
| Model | Approach | Key Feature |
|---|---|---|
| DriveVLM (Tsinghua, 2024) | VLM + chain-of-thought + hierarchical planner | Scene description → critical object identification → decision → planning |
| LMDrive (2024) | Closed-loop LLM driving with language instructions | Takes natural language navigation instructions as input |
| DriveLM (OpenDriveLab, 2024) | Graph-structured visual QA for driving | Structures driving reasoning as a graph of perception → prediction → planning QA pairs |
The VLA4AD survey taxonomy
The VLA for Autonomous Driving (VLA4AD) survey organises driving VLAs into two paradigms:
- End-to-end VLA: a single model maps sensor inputs directly to driving actions (EMMA, LMDrive). The VLA handles perception, prediction, and planning in one forward pass
- Dual-system VLA: the VLM handles high-level reasoning and scene understanding, while a separate classical or learned planner handles trajectory optimisation (DriveVLM-Dual). The VLM acts as a “copilot” providing structured scene descriptions to a conventional planner
Action tokenisation in driving
The same action tokenisation challenges from robotics appear in driving, with domain-specific solutions:
| Approach | Model | How it works |
|---|---|---|
| Text-based trajectory | EMMA | Waypoint coordinates written as floating-point text: “(2.1, 0.3), (4.5, 0.8), …” |
| Discrete motion tokens | MotionLM (Waymo, 2023) | Quantise trajectory space into discrete tokens; model multi-agent motion forecasting as next-token prediction |
| Continuous regression | UniAD, VAD | Direct regression of BEV trajectory waypoints via MLP head |
MotionLM is particularly elegant: it treats multi-agent trajectory prediction as a language modelling problem, tokenising the 2D position space into a discrete vocabulary and predicting future positions autoregressively. This directly parallels RT-2’s approach of representing robot actions as text tokens.
Robotics vs driving: key differences
| Dimension | Robotic Manipulation | Autonomous Driving |
|---|---|---|
| Action space | 6-7 DOF end-effector pose + gripper | 2D trajectory waypoints (x, y) or steering + acceleration |
| Temporal horizon | 1–10 seconds | 5–15 seconds (trajectory prediction) |
| Safety criticality | Low (tabletop tasks) | Extremely high (human lives at stake) |
| Multi-agent | Usually single-agent | Must predict and react to dozens of agents |
| Evaluation | Real-world trials (20–50) | Closed-loop simulation + real-world drives |
| Data scale | ~1M episodes | Millions of driving hours (Waymo, Tesla) |
Despite these differences, the core VLA insight transfers: pre-trained VLMs provide world knowledge that improves generalisation to novel scenarios, whether those scenarios involve unseen objects on a table or rare driving situations at unusual intersections.
Summary
Vision-Language-Action models represent the convergence of three research threads: visual perception (ViT, DINO), language understanding (LLMs), and embodied control (robot learning). The key architectural pattern — vision encoder → language model → action head — is simple, but its power comes from inheriting internet-scale knowledge through pre-trained VLM backbones.
The field’s central tensions remain:
- Discrete vs continuous actions: tokenising actions as text is elegant but lossy; diffusion/flow matching is expressive but slow
- Generalist vs specialist: embodiment-agnostic models are appealing but still underperform specialist policies
- Simulation vs real-world evaluation: simulation is cheap and reproducible; real-world is expensive but trustworthy
- Data scarcity: robot data remains orders of magnitude smaller than internet data, making co-training strategies essential
The extension to autonomous driving (EMMA, DriveVLM, MotionLM) shows that the VLA paradigm is not limited to tabletop manipulation — it is a general framework for any system that must perceive, reason, and act in the physical world.
References
- Brohan et al., “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” arXiv:2307.15818, 2023
- Octo Model Team, “Octo: An Open-Source Generalist Robot Policy,” arXiv:2405.12213, 2024
- Kim et al., “OpenVLA: An Open-Source Vision-Language-Action Model,” arXiv:2406.09246, 2024
- Black, Nakamoto et al., “pi0: A Vision-Language-Action Flow Model for General Robot Control,” Physical Intelligence, 2024
- Zhao et al., “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware” (ACT), RSS 2023
- Chi et al., “Diffusion Policy: Visuomotor Policy Learning via Action Space Diffusion,” RSS 2023
- Hwang et al., “EMMA: End-to-End Multimodal Model for Autonomous Driving,” arXiv 2024 (accepted at TMLR), 2024
- Seff et al., “MotionLM: Multi-Agent Motion Forecasting as Language Modeling,” ICCV 2023
- Tian et al., “DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models,” arXiv:2402.12289, 2024
- Shao et al., “LMDrive: Closed-Loop End-to-End Driving with Large Language Models,” CVPR 2024
- Ma et al., “DriveLM: Driving with Graph Visual Question Answering,” ECCV 2024
- Li et al., “SIMPLER: Simulated Manipulation Policy Evaluation for Real Robot Setups,” arXiv:2405.05941, 2024
- Open X-Embodiment Collaboration, “Open X-Embodiment: Robotic Learning Datasets and RT-X Models,” arXiv:2310.08864, 2023