Design a Perception Stack
You have studied every major layer of robot perception: signal capture, computer vision, depth sensing, state estimation, SLAM, sensor fusion, world modeling, and uncertainty handling. Now you will put it all together. In this lesson you act as a robotics engineer designing a complete perception stack from scratch — choosing sensors, algorithms, representations, and failure policies — and defending every decision against engineering constraints and failure scenarios.
What Is a Perception Stack?
A perception stack is the complete software and hardware pipeline that transforms raw sensor data into the world model a robot's planner consumes. It has four layers. Layer 1 — Sensing: the physical sensors mounted on the robot. Each sensor is chosen for its coverage, accuracy, latency, power draw, cost, and failure mode. The sensing layer determines what information is available; everything above it is limited by this ceiling. Layer 2 — Perception algorithms: the software that processes raw sensor data into geometric and semantic outputs. Object detectors, segmentation networks, depth estimators, feature extractors. This layer transforms signals into observations. Layer 3 — State estimation and mapping: Kalman filters, particle filters, SLAM algorithms. This layer integrates observations over time to maintain a best estimate of the robot's pose and a map of the environment. Layer 4 — World model: the persistent, queryable representation of the environment — occupancy grids, semantic maps, scene graphs — that the planner queries. This layer is the perception stack's output contract with the rest of the system. Every design decision in one layer has implications for all layers above and below it. A sensor with high latency (Layer 1) forces buffering and temporal alignment in the fusion layer (Layer 3). A world model that stores only 2D geometry (Layer 4) limits the planner to 2D navigation. Understanding these vertical dependencies is the art of perception system design.
A perception stack is not just implemented — it is specified. The planner needs to know: what outputs will the perception stack guarantee? At what rate? With what uncertainty? Under what conditions might it fail? These guarantees form the interface contract between perception and planning. A planner that assumes perfect obstacle detection and a perception stack that admits 2% false negatives in the dark is a system waiting for a collision.
Every perception algorithm has a computational cost measured in milliseconds per frame. A robot with a 10 Hz control loop has at most 100 ms per frame for all perception computation. A YOLOv8-large model on a CPU takes ~400 ms. An embedded GPU (like an NVIDIA Jetson Orin) runs it in ~15 ms. Hardware selection and algorithm choice are inseparable. Always specify the target hardware before claiming an algorithm 'runs in real-time.'
Full Perception Stack Design Challenge
- This is the central task of this lesson. You will design a complete perception stack for one of the following three robotic systems. Choose one.
- Option A: AUTONOMOUS LAST-MILE DELIVERY ROBOT — A wheeled robot that navigates city sidewalks to deliver packages. It operates outdoors in all weather, 24 hours a day. It must avoid pedestrians, cyclists, and obstacles, read building addresses, and navigate to a specific door.
- Option B: SURGICAL ASSISTANT ROBOT — A robot arm that assists a surgeon by holding instruments and tracking surgical sites. It operates 30 centimeters from the patient. It must track the surgeon's hands, labeled instruments, and tissue geometry in real time. Failure consequences are life-critical.
- Option C: AUTONOMOUS ORCHARD HARVESTING ROBOT — A mobile robot arm that navigates between rows of fruit trees, detects and grasps ripe fruit, and avoids damaging branches. It operates outdoors in varying light, wind, and among organic, irregularly shaped objects.
- Your design document must address ALL of the following sections:
- SECTION 1 — MISSION ANALYSIS (before choosing a single sensor or algorithm):
- - What decisions does the planner need to make? List at least 5 distinct planning decisions.
- - What information does each decision require from the perception stack?
- - What are the consequences of each type of perceptual failure (FP vs FN for the most critical detections)?
- SECTION 2 — SENSOR SELECTION:
- - List every sensor you will use. For each: state the sensor type, what it measures, why it is needed, and what its primary limitation is.
- - Identify which sensors are redundant (for fault tolerance) and which are complementary (for coverage).
- - State your target compute platform and justify the choice.
- SECTION 3 — PERCEPTION ALGORITHM LAYER:
- - For each major perception task (object detection, depth estimation, segmentation, etc.), name the algorithm or model architecture you would use, its output format, its approximate latency on your chosen hardware, and why you chose it over alternatives.
- SECTION 4 — STATE ESTIMATION AND MAPPING:
- - What state does the robot need to estimate? Write the state vector.
- - What filter or SLAM algorithm will you use? Justify the choice.
- - How will you handle the scenario where the primary localization sensor fails?
- SECTION 5 — WORLD MODEL:
- - What representation will your world model use?
- - What queries must it answer for the planner, and does your chosen representation support all of them?
- - How will you handle dynamic elements?
- SECTION 6 — UNCERTAINTY AND FAILURE POLICY:
- - Identify the three most dangerous perceptual failure modes for your system.
- - For each, specify: how will the system detect the failure? What will it do in response?
- - Define the operating envelope: under what conditions is this perception stack NOT safe to use?
- Present your design as a structured engineering document. Every claim must be justified. Every limitation must be acknowledged. This is the kind of document a robotics company would review before committing to a hardware build.
Flashcards — click each card to reveal the answer
A robotics team finishes implementing their perception stack and then hands it to the planning team with the message: 'We output a bounding box for every detected obstacle at 30 Hz.' Why is this an incomplete interface specification?
A perception stack designer chooses a 64-beam rotating lidar as the primary depth sensor for an outdoor robot. During winter testing, the lidar's point cloud becomes filled with spurious returns in snowfall, making the obstacle map unusable. This failure was foreseeable. What design step should have prevented it?