Skip to main content
Robotics & Embodied AI

⏱ About 15 min15 XP

Human-Robot Interaction

When you meet a new person, you exchange information in dozens of ways simultaneously: words, tone of voice, eye contact, hand gestures, facial expressions, posture. You do this effortlessly, without thinking about the rules. Robots have no such effortless social instinct. Teaching a machine to communicate naturally with a human being is one of the richest and most difficult problems in all of robotics.

Human-Robot Interaction (HRI)

Human-robot interaction, often abbreviated HRI, is the field of research that studies how people and robots communicate, understand each other, and work together. It draws from robotics, psychology, cognitive science, and design.

Speech and Language

Speech is the most natural communication channel for humans, so robots that operate near people are often given voice interfaces. A voice interface has two components: automatic speech recognition (ASR), which converts spoken words into text, and natural language understanding (NLU), which figures out what the person meant by those words. Recognizing words is hard enough — accents, background noise, and rapid speech all create errors. Understanding intent is harder still. If you tell a robot helper in a grocery store 'I am looking for something sweet,' the robot needs to infer that you want a product recommendation, not a definition of the word sweet. Context, common sense, and conversational memory all play a role that simple keyword matching cannot handle. Modern robots increasingly use large language models as their language understanding backbone, giving them far more conversational ability than the hand-coded dialog systems of a decade ago.

Gestures and Gaze

People communicate extensively without words. A pointed finger, a head nod, a glance toward an object — these are meaningful signals that a socially aware robot must detect and interpret. Gesture recognition uses cameras and computer vision to identify hand and body movements. A robot that can follow a pointing gesture can understand 'put it over there' without needing the speaker to specify exact coordinates. Gaze tracking — determining where a person is looking — helps a robot judge whether it has a person's attention, whether the person is confused, and where the person is directing their focus. Some robots produce gestures themselves. A robot that tilts its head when uncertain, nods when acknowledging a request, or points to draw a human's attention to something is using nonverbal cues to make the interaction feel more natural and legible.

Flashcards — click each card to reveal the answer

Facial Expression and Emotion

Some robots are designed to express and recognize emotions, particularly those used in education, eldercare, or therapy. Robots like Pepper and NAO use cameras to analyze facial expressions and body language, classifying a person's apparent emotional state to adjust their responses. If a student looks confused, a tutoring robot might slow down and offer a different explanation. Robots can also produce expressions. Screens, articulated faces, or carefully chosen movements signal emotional states to humans. Research shows that when a robot expresses emotions — even very simply — people treat it differently. They are more patient with it, more forgiving of errors, and more likely to collaborate effectively. This raises a careful question: if a robot can simulate emotions it does not feel, is displaying those emotions honest or manipulative? HRI researchers take this question seriously.

The Uncanny Valley

As robots look more human, people's comfort with them generally increases — until the robot becomes almost-but-not-quite human. At that point, familiarity flips to unease or revulsion. This dip in the comfort graph is called the uncanny valley. Many robot designers deliberately keep robots looking clearly robotic or clearly cartoon-like to avoid the valley.

A robot in a hospital hallway hears a visitor say 'I need to find room 412.' The robot must convert those sounds into words, then understand what help is being requested. Which two technologies handle these two steps?

Why do many robot designers deliberately make their robots look clearly robotic rather than highly realistic?

Decode the Interaction

  1. Step 1: Watch a short clip (or read a description) of a person interacting with a voice assistant, robot, or chatbot.
  2. Step 2: List every communication channel the human uses: spoken words, tone, gesture, facial expression, gaze.
  3. Step 3: For each channel, note whether the robot or assistant successfully interpreted it, failed to interpret it, or ignored it.
  4. Step 4: Identify the single biggest communication gap in the interaction — what did the human express that the robot completely missed?
  5. Step 5: Propose one technical improvement that could close that gap. Be specific about what sensor or algorithm would help.