Foundation Models for Robotics
For most of robotics history, a robot arm that could sort red blocks from blue ones could not transfer that skill to sorting vegetables from fruit. Every new task required a new model, a new dataset, and weeks of engineering effort. This brittleness was the central obstacle to deploying robots outside carefully structured factory environments. Then, in the early 2020s, something changed: researchers began asking whether the large foundation models transforming language and vision could be adapted to also control physical robots. The results have been striking enough to define a new subfield.
What Is a Foundation Model?
A foundation model is a large neural network pretrained on massive, diverse datasets in a self-supervised way. The name was coined by Stanford researchers in 2021 to describe models like GPT-3, CLIP, and DALL-E — systems so large and so broadly trained that they develop rich internal representations transferable to many downstream tasks by fine-tuning on small task-specific datasets. The key properties of a foundation model are scale, generality, and transferability. Scale means billions of parameters trained on hundreds of billions of tokens or images. Generality means the internal representations capture structure useful across many tasks, not just the one it was trained on. Transferability means a small amount of task-specific data can adapt the model to a new application without training from scratch. For robotics, this matters enormously. If a robot's perception and reasoning system is pretrained on billions of images and text descriptions from the internet, it already knows what a mug looks like, what the word 'grasp' means, and what a kitchen counter typically contains. A small amount of robot-specific training can then teach it to act on that knowledge — which is vastly more efficient than training everything from zero.
The internet contains billions of images of objects in diverse contexts, millions of how-to videos of human hands manipulating things, and billions of sentences describing physical actions. A model pretrained on this data has already learned a rich physical common-sense of the world before it has touched a single robot training example. This prior knowledge is what makes foundation-model-based robots able to generalize to new objects and tasks far better than narrowly trained predecessors.
Vision-Language-Action Models
The most direct application of foundation models to robotics is the vision-language-action model, or VLA. A VLA takes as input a visual observation of the robot's environment — typically one or more camera frames — along with a natural-language instruction like 'pick up the red apple and place it in the bowl.' It outputs a robot action: joint positions, gripper open or close, a trajectory to follow. The first prominent example was SayCan (2022), from Google Robotics and Everyday Robots. SayCan used a large language model to do high-level task planning and a separate value function trained in simulation to score which actions were physically feasible. It enabled a robot to follow complex multi-step instructions like 'I spilled my drink. Can you bring me something to clean it up?' by chaining actions the LLM identified as semantically appropriate and the value function deemed physically executable. RT-1 (Robotics Transformer 1), also from Google Robotics and published in 2022, went further. It trained a transformer model on over 130,000 real robot demonstrations collected across 13 robots over 17 months. RT-1 achieved 97% success on tasks from its training distribution and, crucially, 76% success on tasks never seen during training — a level of generalization far beyond anything earlier robotics models had demonstrated. Its successor, RT-2 (2023), replaced the vision-language backbone with a pretrained large vision-language model (PaLI-X) and achieved further gains in generalization and emergent reasoning about novel objects. OpenVLA (2024) brought similar capabilities to an open-source model with 7 billion parameters, trained on the Open X-Embodiment dataset — a collection of robot trajectories from over 20 research institutions. OpenVLA demonstrated that open community collaboration could match proprietary VLA performance, democratizing access to foundation robotics models.
Match each robotics foundation model or concept to its correct description.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Challenges and Limitations
Foundation models for robotics are impressive, but they face severe challenges that prevent them from being deployed freely today. Data distribution mismatch is the first. Internet images depict humans manipulating objects with human hands in human environments. Robot morphologies differ from human hands, and robot cameras are positioned differently from human eyes. The pretrained representations must bridge this embodiment gap, which is non-trivial. Safety and reliability are the second. A large language model that hallucinates a wrong fact is an inconvenience. A VLA that hallucinates the wrong action while a robot arm is near a person can cause serious injury. The long tail of rare, dangerous situations where the model might fail is poorly characterized, and certifying safety is a hard open problem. Latency is the third. Large transformer models require substantial compute to run inference. Many manipulation tasks require control at 10-100 Hz — ten to one hundred decisions per second — which is far faster than current large VLAs can natively support. Researchers address this with smaller distilled models for low-level control and larger models for high-level planning, but integrating them cleanly is an active engineering challenge. Finally, physical grounding remains limited. A VLA trained on videos and text has seen a lot, but it has not felt the resistance of a jar lid, the fragility of a raw egg, or the slip of wet fingers on a glass. Tactile and proprioceptive grounding — learning from force, touch, and the robot's own kinesthetic experience — remains an underexplored frontier.
RT-2 improved on RT-1 primarily by replacing a custom vision-language backbone with what?
Why is latency a particular challenge for deploying large vision-language-action models in real robot systems?
Design a VLA Training Pipeline
- You are a robotics researcher tasked with training a vision-language-action model to help robots in a hospital pharmacy pick and pack medications.
- Step 1: Define the input modalities your VLA will use. What cameras? What other sensors? What language interface will staff use to issue instructions?
- Step 2: Describe the pretraining data you would use. Where does it come from? What does it cover? What are its limitations for this specific domain?
- Step 3: Describe the fine-tuning data you would collect. How many demonstrations? Who demonstrates? What tasks? How do you handle rare edge cases?
- Step 4: Identify the three most likely failure modes of your VLA in the pharmacy setting. For each, propose a mitigation strategy.
- Step 5: What human oversight mechanisms would you require during the deployment phase before the system operates autonomously?
- Produce a structured one-page design document covering all five steps.