Skip to main content
AI Agents & Automation

⏱ About 20 min20 XP

Monitoring and Observability

Deploying an agent and hoping it works is not engineering — it is optimism. Real engineering requires the ability to see what the agent is doing, detect when something is wrong, and diagnose why. This requires observability: the property of a system that allows you to understand its internal state from its external outputs. For agents, achieving observability requires deliberate design decisions made before deployment, not instrumentation added after something breaks.

Observability vs. Monitoring

Monitoring and observability are related but distinct. Monitoring answers the question: 'Is the system healthy right now?' It tracks a predefined set of metrics — error rate, latency, success rate, cost per task — against defined thresholds, and triggers alerts when thresholds are breached. Monitoring is forward-looking: you decide in advance what could go wrong, measure it, and set alarms. Observability answers a broader question: 'What is the system doing, and why?' It provides the data necessary to investigate novel failures that you did not anticipate when setting up monitoring. Where monitoring tells you that something is wrong, observability tells you what happened in enough detail that you can diagnose the root cause and fix it. For agents, both are necessary. Monitoring catches known failure patterns quickly. Observability allows you to investigate the unknown failures that agents — with their stochastic, multi-step behavior — will inevitably produce.

The Three Pillars of Observability

The three foundational data types for observability are: logs (structured records of individual events — each tool call, each model response, each decision point), metrics (numerical measurements over time — latency, success rate, token count per task), and traces (end-to-end records of a single agent execution that link all its steps into a coherent timeline). Together, these three allow you to ask any question about agent behavior and answer it from data.

Tracing Agent Executions

A trace is a structured record of every step in a single agent execution, from the initial task input to the final output or terminal error. A good trace captures: the timestamp and duration of each step, the tool called (or model queried) at each step and its input arguments, the response received, the agent's next decision given that response, and any errors or retries that occurred. Tracing is what allows a developer to reconstruct the exact sequence of events in a failed agent run, even days after the fact. Without a trace, a failure in a complex 20-step agent execution is nearly impossible to diagnose — the behavior is stochastic, the environment state has changed, and there is no record of what the agent actually did. With a trace, the developer can replay the execution mentally, identify exactly where the reasoning went wrong, and determine whether the failure was due to model error, environment mismatch, or a design flaw. Modern agent frameworks provide tracing utilities out of the box. LangSmith (for LangChain), Langfuse (framework-agnostic), and Weights and Biases (W&B Traces) are commonly used production tracing systems for agents. Each captures the structure of agent executions and provides a UI for browsing and filtering traces.

Metrics and Dashboards

The key agent metrics to track in production differ from traditional software metrics in important ways. For a web server, you track request latency and error rate. For an agent, you track: Task success rate: the fraction of tasks that achieve the intended outcome, measured continuously as the agent operates in production. This is the primary health metric. Step count per task: how many tool calls and model queries each task requires. A rising step count indicates the agent is becoming less efficient or is encountering new patterns it struggles with. Cost per task: total token spend and API cost per completed task. A rising cost per task with constant success rate may indicate a regression in efficiency; a rising cost with declining success may indicate the agent is looping. Time to completion: wall-clock time per task. Relevant both for user experience and for detecting loops — a task that usually takes 30 seconds but is now taking 5 minutes is likely in trouble. Error type distribution: breakdown of failures by category — tool errors, model errors, timeout errors, scope errors. Changes in this distribution point directly to where the agent is breaking.

Anomaly detection applies statistical methods to these time-series metrics to automatically identify when behavior deviates from historical patterns. A simple baseline anomaly detector might flag any metric that moves more than two standard deviations from its 7-day rolling average. More sophisticated systems use learned baselines that account for time-of-day and day-of-week patterns, sudden-change detectors that catch step-changes faster than rolling averages, and multivariate detectors that look for correlated changes across multiple metrics simultaneously — which often indicate a systemic failure rather than a local one.

Log Decisions, Not Just Actions

A common mistake in agent logging is to record only the tools called and their outputs, omitting the agent's intermediate reasoning. The most valuable diagnostic information is often the model's chain-of-thought or step-level reasoning at the point just before a wrong action. Capturing structured reasoning logs — not just action logs — is what turns an agent trace from a sequence of events into a diagnostic tool.

Match each observability data type to the specific diagnostic question it is best equipped to answer.

Terms

Trace of a single agent execution
Time-series metric: step count per task (7-day chart)
Structured log entry: tool call arguments and response
Error type distribution dashboard
Anomaly alert: task success rate dropped 15 points vs. 7-day baseline

Definitions

Has the agent become less efficient over the past week, or is today's high step count a one-off?
What category of failure is most common this week, and is it changing over time?
What exact input did the agent pass to the search tool, and what did the tool return?
Something changed in the last few hours — the agent is failing more often than it normally does
Why did this specific agent run fail at step 8 when all the previous steps looked correct?

Drag terms onto their definitions, or click a term then click a definition to match.

A production agent's task success rate drops from 87% to 71% over a 48-hour period. An anomaly alert fires. What should the on-call engineer do first?

A developer notices that the traces for their agent only record which tools were called and their outputs, but not the agent's step-level reasoning. Why is this a significant gap in observability?

Flashcards — click each card to reveal the answer

Design a Monitoring Dashboard

  1. You are the reliability engineer for a deployed agent that helps users fill out government benefit applications. The agent collects information, validates it against form requirements, and submits completed applications on the user's behalf. Errors have serious real-world consequences: a wrong submission can delay benefits for weeks.
  2. Step 1: Design the monitoring dashboard for this agent. List exactly which metrics you would display, the time range for each chart, and the alert threshold for each metric. Be specific — for example: 'Task success rate: 7-day rolling chart, alert if drops below 90% for 1 hour.'
  3. Step 2: Define what information a trace for this agent must capture. Write a checklist of the fields every trace entry should include, with a brief justification for each field.
  4. Step 3: Design one anomaly detection rule for this agent that would not be captured by simple threshold monitoring. Describe what multi-metric pattern it would look for, and what failure mode it is designed to detect early.
  5. Step 4: A user reports that their application was submitted with incorrect information. Walk through the exact debugging process you would follow using your observability infrastructure. What would you look at first, second, and third?