Depth, 3D, and Point Clouds
A single camera image is a 2D projection of a 3D world. That projection discards exactly the information a robot most needs for physical interaction: distance. Two objects of very different sizes can project to the same number of pixels depending on how far they are from the camera. A robot reaching for an object cannot act on pixel coordinates alone — it needs to know where that object is in 3D space. Recovering or directly measuring depth is therefore one of the most fundamental requirements in robot perception.
How Depth Sensors Work
Three sensing modalities dominate 3D robot perception: lidar, stereo cameras, and structured light / time-of-flight depth cameras. Each makes a different physical measurement, has a different failure mode, and produces output at a different density, range, and rate. Lidar (Light Detection and Ranging) fires short pulses of laser light and measures the time each pulse takes to return after reflecting off a surface. Since light travels at a known speed, time-of-flight directly encodes distance. A rotating lidar, like the Velodyne HDL-64E used on early autonomous vehicles, fires 64 laser beams simultaneously across a 360-degree arc, producing up to 1.3 million range measurements per second. Each measurement becomes a point (x, y, z) in the sensor's coordinate frame. The result is a point cloud: a sparse but geometrically accurate 3D snapshot of the environment out to distances of 100+ meters. Lidar produces sparse, accurate, long-range data but cannot measure object color or texture. It also fails on highly reflective surfaces (mirrors, wet roads) and performs poorly in heavy rain or fog, where laser pulses scatter before returning. Stereo cameras place two cameras with a known baseline — the distance between their optical centers. An object at finite distance produces slightly different images in the two cameras: the horizontal offset between corresponding pixels is called disparity, and disparity is inversely proportional to distance. By computing disparity across the image (stereo matching), the system recovers a dense depth map. Stereo cameras produce texture-rich, dense depth but require good lighting and fail in featureless regions (blank walls, uniform floors) where corresponding pixels cannot be matched. Structured light depth cameras (like the Intel RealSense or original Kinect) project a known infrared pattern onto the scene and observe how that pattern deforms on surfaces. Time-of-flight depth cameras (like more recent ToF sensors) measure round-trip time for each pixel simultaneously. Both produce dense depth maps at short range (0.5 to 8 meters typically) and work indoors, but struggle in bright sunlight (which overwhelms the infrared signal) and at the ranges required for autonomous driving.
Lidar produces a sparse cloud — typically a few hundred thousand points covering the visible hemisphere, with significant gaps. Stereo and structured light produce dense maps — a depth value for (nearly) every pixel in the image. Sparse data is easier to process and more geometrically accurate; dense data contains more surface detail and can support texture-based tasks like grasping. Modern systems often fuse both.
Match each depth sensing modality to the property that best distinguishes it from the others.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Point Clouds: Representation and Processing
A point cloud is an unordered set of 3D points, each encoding a surface location in some coordinate frame. Unlike images (which have a regular grid structure), point clouds are irregular — points are denser where surfaces are close or perpendicular to the sensor, sparser elsewhere, and completely absent in occluded regions. This irregularity creates a processing challenge. Standard 2D convolution, which assumes a regular grid, cannot be applied directly to point clouds. Early approaches voxelized the point cloud — placed it in a 3D grid and filled cells with occupancy values — then applied 3D convolution. This is computationally expensive and loses information at cell boundaries. PointNet (Qi et al., 2017) was a landmark architecture that operated directly on unordered point sets. It processes each point independently through a shared multilayer perceptron (MLP), then applies a symmetric aggregation function (global max pooling) to produce a shape descriptor invariant to point ordering. This allows classification and segmentation of 3D shapes directly from point cloud input without voxelization. A practical processing pipeline for robot navigation: raw lidar scan → remove ground plane (RANSAC plane fitting) → cluster remaining points into candidate objects (DBSCAN or Euclidean cluster extraction) → classify each cluster using a 3D network or geometric heuristics → associate with tracked objects from previous frames. This pipeline runs in tens of milliseconds on embedded hardware, enabling real-time obstacle detection. Normal estimation is frequently applied to point clouds before downstream processing. A surface normal at point p is a vector perpendicular to the local surface. Normals are computed by fitting a local plane to p's neighbors and taking the plane's normal direction. Normals encode surface orientation — essential for grasp planning (the gripper must approach along the surface normal).
Complete the description of point cloud data structure.
Points in a lidar scan are expressed in the sensor's local coordinate frame. To reason about the world, all sensor data must be transformed into a common reference frame — typically the robot's base frame or a global map frame. This requires knowing the sensor's pose (position and orientation) relative to the robot, stored as a rigid-body transformation. Coordinate frame errors, even small ones, compound dramatically over distance.
A mobile robot is navigating a parking garage. Its lidar suddenly produces many spurious return points scattered throughout the scan where no surfaces exist. The most likely physical cause is:
A robot arm needs to grasp a cylindrical can on a table. The perception system provides a point cloud of the can's visible surface. Why is the visible surface point cloud alone insufficient for computing a stable grasp?
Depth Sensor Comparison: Choose the Right Sensor for the Job
- Three robots need depth sensing for different missions. For each robot, recommend the best primary depth sensing modality, justify your choice by comparing the modalities' properties, and identify the biggest risk your recommendation carries.
- Robot A: An autonomous vehicle navigating public roads at up to 60 mph. It must detect pedestrians, vehicles, and obstacles at ranges up to 80 meters in rain and fog.
- Robot B: A robotic arm in a factory that picks parts from a bin. Parts are within 0.5 to 1.5 meters of the sensor. The factory floor is brightly lit. The robot needs dense surface data to plan grasps.
- Robot C: A search-and-rescue drone flying through smoke-filled indoor corridors. It needs to avoid walls and debris at ranges up to 5 meters.
- For each robot: state your recommended modality, list two properties that make it appropriate, and identify the scenario most likely to cause it to fail. Then state whether a secondary modality would be worth adding and why.