Skip to main content
AI Foundations

⏱ About 20 min20 XP

Robotics and Embodied AI

Every AI system we have studied so far — language models, image classifiers, recommendation engines — lives in a world of numbers. Its inputs are data; its outputs are predictions or text. But some AI systems must interact with the physical world: they receive sensor inputs from cameras, microphones, and pressure sensors, and they output motor commands that move joints, wheels, or grippers. This is embodied AI. It is older than deep learning, and in many ways it remains the hardest problem in the field.

The Perception-Action Loop

A robot's fundamental architecture is a perception-action loop: sense the environment, interpret the sensory data, decide on an action, execute it, observe the result, repeat. Each stage poses distinct challenges that do not arise in purely digital domains. Perception: a robot's sensors are noisy, limited, and physically constrained. A camera mounted on a robot arm does not see like a human eye — it has a fixed field of view, is affected by lighting changes, and may be occluded by the robot's own body. Depth estimation from a monocular camera requires learning to infer three-dimensional structure from two-dimensional images. Lidar provides accurate point clouds but is expensive and affected by rain, fog, and reflective surfaces. Sensor fusion — combining data from multiple sensor modalities — is a fundamental technique: a system that fuses camera, lidar, and radar is more robust than one relying on any single sensor, because sensor failures and conditions that degrade one sensor type often do not degrade others. State estimation: a robot must maintain a model of where it is and what is around it, even as it moves. Simultaneous Localization and Mapping (SLAM) is the class of algorithms that build a map of an unknown environment while tracking the robot's position within it. Classic SLAM uses probability theory (particle filters, Kalman filters) to estimate the robot's pose as a distribution rather than a single point, acknowledging sensor noise. Planning: given a current state and a goal, what sequence of actions reaches the goal while avoiding obstacles? In a warehouse where the robot arm knows exactly where every box is and the box's dimensions are specified to the millimeter, planning is tractable. In an unstructured environment — a kitchen, a construction site, or a public sidewalk — the combinatorial explosion of possible configurations makes classical planning insufficient, and reinforcement learning or learned motion policies are used instead.

Moravec's Paradox

In 1988, roboticist Hans Moravec observed that tasks that are difficult for humans — chess, calculus, medical diagnosis — are relatively easy for computers, while tasks trivially easy for humans — recognizing a cup by feel, walking on gravel, catching a thrown ball — are extraordinarily hard for robots. This paradox reflects the fact that skills humans find easy are often supported by millions of years of evolved sensorimotor machinery, while skills humans find hard are recent additions to cognition that are well-described by explicit rules.

The hardest part of embodied AI is often manipulation: getting a robot hand or gripper to interact with objects in an unstructured environment. A task as mundane as picking up a crumpled piece of paper from a table requires solving: where is the paper (perception), how does it deform when touched (contact modeling), where should I grasp it (grasp planning), and how much force should I apply (force control). Humans solve these in milliseconds through proprioception and tactile feedback that robots only partially replicate. Reinforcement learning has been applied to robot manipulation with notable successes. DeepMind's work on dexterous in-hand manipulation, OpenAI's robot that solved a Rubik's cube one-handed (Dactyl, 2019), and more recently Google's RT-2 (Robotics Transformer 2, 2023) — which used a vision-language model pre-trained on internet data to enable a robot to follow natural language manipulation instructions — represent genuine advances. But these systems require enormous amounts of training (often in simulation, then transferred to real hardware), and they remain brittle: a RT-2-class robot that competently picks up an apple on a table in the lab may fail on the same task with different lighting, a different apple variety, or a cluttered background. The sim-to-real gap is one of the central problems of robot learning. Training in simulation is fast and safe — you can run millions of trials — but the simulated physics and visual appearance differ from the real world. Techniques for closing this gap include domain randomization (training across many varied simulated conditions so the policy becomes robust to variation) and system identification (carefully measuring real-world physical parameters and incorporating them into the simulator).

The Brittleness Problem

Current robot learning systems, including the most advanced manipulation policies, are brittle outside their training distribution. A policy that succeeds 95% of the time on one table may fail 50% of the time with a different tablecloth, different object placement, or different ambient lighting. This brittleness is not a minor engineering issue — it reflects a fundamental gap between the diversity of the real world and the coverage of any training distribution.

Fill in the correct terms.

A robot that builds a map while tracking its own position within it is performing , while the gap between a policy trained in simulation and its performance in the real world is called the gap.

Why does reinforcement learning often occur in simulation rather than directly on physical robots?

Moravec's paradox suggests that AI should find which task relatively easy compared to a human?

Robot Task Complexity Analysis

  1. Choose a physical task that seems simple: making a sandwich, folding a towel, watering a plant, or loading a dishwasher.
  2. Break the task into every sub-step a robot would need to perform, from the moment it receives the instruction to the moment the task is complete.
  3. For each sub-step, identify: What sensory input does the robot need? What information must it estimate or infer? What could go wrong?
  4. Count the number of distinct sensing, estimation, and manipulation sub-problems.
  5. Reflect: before this exercise, did you expect this task to have this many sub-problems? What does that tell you about why household robotics is not yet a solved problem?
  6. Share your analysis with a classmate who chose a different task. Which task was more complex than expected?