Observability: Metrics & Tracing
Stage 3 · DevOps, Deployment & Operations · B.U.I.L.D. letter: D
You shipped it. Users are using it. Then one Tuesday morning your phone buzzes: "Hey, is the site down?" — and you have absolutely no idea. You don't know when it broke, which part is slow, or why it's happening to only some users. That gap between "it's deployed" and "I know what it's doing right now" is called observability, and closing that gap is what this lesson is about.
⚠️ The vibe trap
You built a beautiful front end, deployed it, and life is good — until a user DMs you that checkout has been slow "for like two weeks." You had no idea. The vibe coding part is done; the engineering part is just beginning.
The second trap is checking averages: your dashboard says average response time is 120 ms, so everything is fine — except the slowest 5% of users are waiting 4 seconds and silently leaving. Averages hide the suffering. Percentiles reveal it. If you're not measuring the right things in the right way, you are flying blind with a smile on your face.
🗂️ The Three Pillars of Observability
Every observable system is built on the same three signals. They answer different questions, and you need all three.
| Pillar | What it is | Question it answers |
|---|---|---|
| Logs | Timestamped text events | "What happened?" |
| Metrics | Numbers aggregated over time | "How much / how often?" |
| Traces | End-to-end journey of one request | "Where did the time go?" |
Think of it this way: logs are your diary, metrics are your vital signs, and traces are the security camera footage that shows you exactly where a suspect went.
Mental model: A user hits your /checkout endpoint. The metric tells you it's slow today. The log tells you there was a database timeout error at 14:32. The trace shows you that the slowness is all inside the payment-service, specifically in the validate_card span — not in your code at all, but in the third-party SDK it calls.
Why it matters: Without all three, you're guessing. Logs alone flood you with noise. Metrics alone can't tell you which user or which code path. Traces alone don't tell you system-wide trends. Together they give you the full picture.
Common mistake: Logging everything but never setting up metrics, so you have gigabytes of text you have to grep in a crisis. Start with structured metrics from day one.
┌─────────────────────────────────────────────────────┐
│ THE THREE PILLARS IN ACTION │
│ │
│ METRICS LOGS TRACES │
│ ─────── ──── ────── │
│ p99=4200ms ERROR timeout [checkout]──┐ │
│ errors=3.2% at 2026-06-03 [auth] 20ms │ │
│ req/s=840 14:32:11 UTC [payment]◄───┘ │
│ payment-service [validate] 3.8s │
│ │
│ "Are we slow?" "What broke?" "Where exactly?"│
└─────────────────────────────────────────────────────┘
📏 Metrics: Numbers That Tell the Truth (If You Measure the Right Ones)
Two frameworks give you a shortlist of what actually matters.
RED (for services — request-driven things):
- Rate — requests per second
- Errors — error rate as a percentage
- Duration — how long requests take
USE (for resources — CPU, memory, queues):
- Utilization — what percentage of capacity is in use
- Saturation — how much work is queued/waiting
- Errors — hardware/resource errors
Together these cover almost everything you'll ever need to know about a service in production.
Why averages lie — use percentiles instead: If 95 requests take 100 ms and 5 take 5,000 ms, the average is 345 ms — a number that represents no actual user's experience. The p50 (median) is 100 ms. The p95 is 5,000 ms. The p99 might be even worse. Your SLA should be on p99, not average — that's the promise to your worst-served users.
Mental model: p99 latency = "What is the slowest one-in-a-hundred requests doing?" If you run a thousand requests per second, that's 10 users every second having that experience.
Common mistake: Alerting on average latency. You will miss real problems that are hurting a meaningful slice of your users.
// Instrumenting metrics with OpenTelemetry (Node.js)
import { MeterProvider } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
const meterProvider = new MeterProvider({
resource: new Resource({ 'service.name': 'checkout-service' }),
});
const meter = meterProvider.getMeter('checkout-service');
// Counter: total number of checkout attempts
const checkoutCounter = meter.createCounter('checkout.attempts', {
description: 'Total number of checkout requests',
});
// Histogram: captures p50/p95/p99 latency automatically
const checkoutLatency = meter.createHistogram('checkout.duration_ms', {
description: 'Checkout request duration in milliseconds',
unit: 'ms',
// OpenTelemetry SDK computes percentiles from histogram buckets
boundaries: [10, 50, 100, 250, 500, 1000, 2500, 5000],
});
// In your route handler:
export async function handleCheckout(req, res) {
const start = Date.now();
checkoutCounter.add(1, { 'checkout.currency': req.body.currency });
try {
await processPayment(req.body);
res.json({ ok: true });
} catch (err) {
checkoutCounter.add(1, { 'checkout.failed': 'true' });
res.status(500).json({ error: err.message });
} finally {
// This single recording gives you p50, p95, p99 for free
checkoutLatency.record(Date.now() - start, {
'checkout.route': '/checkout',
});
}
}
📊 Dashboards & The Four Golden Signals
Google's Site Reliability Engineering book distilled everything into four signals that, if they are healthy, almost everything else is healthy too. Display these on a dashboard and check it before anything else when something goes wrong.
- Latency — how long requests take (show p50, p95, p99 — never just the average)
- Traffic — requests per second (how much demand is on the system)
- Errors — rate of failed requests (5xx, timeouts, explicit failures)
- Saturation — how "full" your service is (CPU %, memory %, queue depth)
Mental model: These four signals are the vital signs of a hospital patient. Heart rate (traffic), blood pressure (saturation), pain level (latency), and temperature (errors). A doctor checks these before doing anything else.
Why dashboards matter: Dashboards let you see trends, not just snapshots. A p99 of 800 ms isn't alarming by itself — but a p99 that climbs from 200 ms to 800 ms over 30 minutes tells you something is slowly degrading, and you can catch it before it becomes an outage.
Common mistake: Making dashboards for yourself (developer-centric) instead of for the system (user-centric). Your dashboard should answer "Is the user experience healthy?" first.
┌──────────────────────────────────────────────────────────────┐
│ checkout-service · last 1 hour · prod │
├────────────────┬──────────────┬──────────────┬───────────────┤
│ TRAFFIC │ LATENCY │ ERRORS │ SATURATION │
│ │ │ │ │
│ 840 req/s │ p50: 92ms │ 0.3% │ CPU: 41% │
│ ▁▂▃▄▄▅▆▆▅▅▄ │ p95: 310ms │ ▁▁▁▁▁▁▁▁▁▃ │ MEM: 58% │
│ │ p99: 890ms ⚠│ │ QUEUE: 12 │
│ Normal range │ ← trending │ Acceptable │ Healthy │
│ │ upward │ │ │
├────────────────┴──────────────┴──────────────┴───────────────┤
│ ⚠ p99 latency has increased 3× in the last 20 minutes. │
│ Investigate: checkout-service → payment-service traces. │
└──────────────────────────────────────────────────────────────┘
🔭 Distributed Tracing: Following One Request Across Everything
When your app is more than one thing — a front end calling an API calling a database calling a third-party service — logs alone cannot tell you where time was spent. A request might touch four services in 400 ms and you have no idea which one ate 350 of those milliseconds.
Distributed tracing solves this by assigning every incoming request a trace ID (a unique string, generated once, at the edge) and passing it through every service, function call, and external request as a span. Each span records its start time, duration, service name, and what it was doing. At the end, you reconstruct the full journey as a waterfall.
Anatomy of a trace:
- Trace ID — one ID for the entire request, shared across all services
- Span — one unit of work (a function call, a DB query, an HTTP request to another service)
- Parent span — a span that contains child spans; the root span is the original HTTP request
- Context propagation — passing the trace ID and span ID in HTTP headers (
traceparent) so downstream services can attach their spans to the same trace
Mental model: Think of a trace like a Gantt chart for one user's request. The horizontal axis is time. Each row is a different service or operation. You can instantly see which row is the widest — that's your bottleneck.
Why it's essential once you have more than one moving part: Without traces, debugging a slow request in a multi-service system means adding temporary logs everywhere, redeploying, and praying the slow request happens again. With traces, you have the waterfall automatically for every request.
Common mistake: Only tracing your own code and not the database calls, external HTTP calls, and queue operations that your code triggers. The bottleneck is almost always in those boundaries.
// OpenTelemetry tracing — add a span to wrap a critical operation
import { trace, context, SpanStatusCode } from '@opentelemetry/api';
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { Resource } from '@opentelemetry/resources';
const provider = new NodeTracerProvider({
resource: new Resource({ 'service.name': 'checkout-service' }),
});
provider.register(); // auto-instruments http, fetch, express, pg, etc.
const tracer = trace.getTracer('checkout-service');
export async function processPayment(orderData) {
// Create a child span — this becomes a row in the trace waterfall
const span = tracer.startSpan('payment.validate', {
attributes: {
'payment.currency': orderData.currency,
'payment.amount_cents': orderData.amountCents,
},
});
try {
const result = await paymentSDK.validate(orderData);
// Mark the span with the outcome
span.setStatus({ code: SpanStatusCode.OK });
span.setAttribute('payment.provider_response_code', result.code);
return result;
} catch (err) {
// Mark it failed — this lights up red in your trace viewer
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
span.recordException(err);
throw err;
} finally {
span.end(); // always end spans, even on error
}
}
// The trace ID travels automatically via the W3C traceparent header.
// Any service that receives this request and is instrumented with OTel
// will attach its own spans to the same trace — no manual plumbing needed.
TRACE: 4bf92f3577b34da6a3ce929d0e0e4736 (one user's checkout at 14:32:11)
────────────────────────────────────────────────────────────────────────────
SERVICE OPERATION DURATION STATUS
────────────────────────────────────────────────────────────────────────────
api-gateway POST /checkout 412ms ✓ OK
└─ checkout-svc checkout.handle 398ms ✓ OK
├─ auth-svc token.verify 11ms ✓ OK
├─ checkout db.query (cart) 18ms ✓ OK
└─ payment payment.validate ███ 357ms ✓ OK ← BOTTLENECK
└─ stripe.charges.create ███ 341ms ✓ OK
────────────────────────────────────────────────────────────────────────────
Without this trace you would know "checkout is slow."
With this trace you know "Stripe is slow today — nothing we can fix."
🌐 OpenTelemetry: The Standard You Should Use
OpenTelemetry (OTel) is the vendor-neutral, CNCF-graduated standard for instrumenting your code. It gives you one SDK that emits logs, metrics, and traces in a standard format that any backend can ingest — Grafana, Datadog, Honeycomb, New Relic, your own self-hosted stack. You instrument once, and you're not locked in.
Key pieces:
- SDK — the library you add to your app (
@opentelemetry/sdk-nodefor Node.js) - Auto-instrumentation — zero-code patches for express, fetch, http, pg, redis, etc.
- Exporter — ships data to your chosen backend (OTLP, Jaeger, Prometheus)
- Collector — optional sidecar that buffers, filters, and routes telemetry (good for prod)
Mental model: OTel is like USB-C for observability. Your app is the laptop; the telemetry backend is the charger. One standard connector, any backend.
Common mistake: Installing OTel but only enabling traces, not metrics, or vice versa. The three pillars work best together — instrument all three from the start.
🛠️ Your Mission
Pick any deployed project you built in this track — the one from the Deployment or CI/CD lessons works perfectly.
- Install
@opentelemetry/sdk-nodeand the relevant auto-instrumentation packages for your framework. - Add one custom latency histogram for your most important user-facing operation (the main API call, the form submit, the data fetch — whatever matters most to a real user).
- Add one counter that increments on every error response (non-2xx), with an attribute for the status code.
- Ensure every incoming request gets a trace ID (OTel auto-instrumentation handles this) and log that trace ID to the console alongside your existing logs so you can correlate them.
- Make one intentionally slow request (add a
setTimeoutin dev, or just load-test it) and find it in your trace output. Read the waterfall. Identify the widest span. - Build a simple text dashboard — even just a terminal
console.logthat prints your four golden signals every 30 seconds — using the recorded metrics.
Your goal: after this mission, when a user reports something is slow, you should be able to open your traces, find their request by trace ID, and point at the exact span that was the bottleneck — in under two minutes.
✅ You're Done When…
- Your app exports all three pillars (logs with trace IDs, a latency histogram, an error counter) — confirmed against the Production-Readiness Checklist
- You have a p99 latency metric, not just an average, and you can explain why the difference matters
- You can look at a trace waterfall and identify the slowest span without help
- Your four golden signals (latency, traffic, errors, saturation) are visible in one place — a dashboard, a log summary, or a terminal printout
- You've added
service.nameas an OTel resource attribute so every span and metric is tagged with which service it came from
➡️ Next: Error Tracking & Alerting. Build It Right, Or Don't Build It At All. 🏛️