Skip to main content
DevOps, Deployment & Operations
🚀 DevOps & OpsLesson 9 of 13

Error Tracking & Alerting

Find out before your users do — error tracking and alerts that actually help.

Error Tracking & Alerting

Stage 3 · DevOps, Deployment & Operations · B.U.I.L.D. letter: D

You shipped. Users are clicking. And somewhere in that beautiful deployment, something just threw a TypeError: Cannot read properties of undefined (reading 'map') — and nobody told you. Except that one user who left a two-star review saying "it just breaks." Error tracking and alerting are the pair of superpowers that flip you from reactive to proactive: you know about the crash before the user does, you know exactly which line caused it, and your phone only rings when something truly matters.


⚠️ The vibe trap

When you're vibe-coding, you test the happy path, it works, you deploy, you celebrate. But production is adversarial — real users have bad connections, unexpected inputs, browser extensions, and edge cases your localhost never imagined. Without an error tracker, you find out about crashes from a frustrated Slack message or a one-star review three days later, armed with nothing but "it doesn't work." Worse, if you wire up naive alerting — "alert me on every 500 error" — you'll get paged 200 times a day and start ignoring all of it. Both failure modes are fixable.


🕵️ Section 1: What an Error Tracker Actually Gives You

Searching logs for errors feels productive until you realize a single bug can generate thousands of identical log lines. An error tracker like Sentry, Bugsnag, or Rollbar does something smarter: it captures the full exception, dedupes it, and gives you one grouped issue with everything you need to fix it.

What you get that logs don't provide:

  • Stack trace pointing to the exact file and line number
  • Breadcrumbs — the sequence of events (clicks, network requests, state changes) leading up to the crash
  • User context — which user was affected, how many users hit this, is it getting worse?
  • Release association — did this error first appear with your last deploy?
  • Environment tag — is this production, staging, or a specific region?

Mental model: Think of your logs as a river of water flowing past. An error tracker is a net — it catches exceptions, groups identical ones into a single bucket, counts how full the bucket is, and hands you the bucket labeled with everything you need. You deal with buckets, not with a river.

Why it matters: When an error tracker says "this issue affects 47 users and started 2 hours ago," you have a business decision to make, not a debugging expedition. That context is what turns "fix the fire eventually" into "roll back the deploy right now."

Common mistake: Installing the SDK and calling it done. You need to also capture user identity and release version at initialization, otherwise your issues are anonymous and unanchored to any deploy.

// Wiring Sentry into a Node/Express backend — do this once at app startup
import * as Sentry from "@sentry/node";

Sentry.init({
  dsn: process.env.SENTRY_DSN,          // never hardcode; comes from env
  environment: process.env.NODE_ENV,    // "production" vs "staging"
  release: process.env.GIT_SHA,         // tie errors to the exact commit
  tracesSampleRate: 0.2,                // capture 20% of transactions for perf
});

// Attach user identity as soon as you know who's logged in
// (call this inside your auth middleware, after verifying the session)
export function identifyUserForErrorTracking(userId, email) {
  Sentry.setUser({ id: userId, email });
}

// Add a breadcrumb manually for any important user action
export function trackAction(message, data = {}) {
  Sentry.addBreadcrumb({ message, data, level: "info" });
}

// Capture a caught error with extra context — don't let it silently swallow
export function captureError(err, context = {}) {
  Sentry.withScope((scope) => {
    scope.setExtras(context);
    Sentry.captureException(err);
  });
}

// Express error handler — must be registered AFTER all routes
// This catches unhandled errors from route handlers
app.use(Sentry.Handlers.errorHandler());

🔔 Section 2: Grouping & Deduplication — One Bug, One Alert

Without deduplication, a single bad deploy can generate 10,000 individual error events in an hour, each triggering a separate notification. That's not a bug report — that's a denial-of-service attack on your attention.

Good error trackers group errors by their fingerprint: the exception type plus the normalized stack trace (stripping out line numbers that vary between deploys). Every new occurrence of the same bug increments the count on one existing issue rather than creating a new one.

Mental model: Imagine a bulletin board. Without grouping, every crash is a new sticky note plastered on top of others until the board is unreadable. With grouping, each unique bug is one sticky note with a tally in the corner. You see ten sticky notes, not ten thousand.

Why it matters: Grouping lets you triage by impact — "this error hit 500 users in the last hour" is immediately more actionable than scrolling through 500 individual error emails.

Common mistake: Trusting default fingerprinting blindly for errors with highly variable messages (like database errors that include a query string in the message). Those will never group correctly. Override fingerprinting for known noisy error classes.

// Override Sentry's fingerprint for a known-noisy error class
// so all DB connection errors group together regardless of query details
export function captureDbError(err) {
  Sentry.withScope((scope) => {
    // Force all database errors into one group
    scope.setFingerprint(["database-connection-error"]);
    scope.setTag("layer", "database");
    Sentry.captureException(err);
  });
}

// On the frontend — capturing an unhandled promise rejection with context
window.addEventListener("unhandledrejection", (event) => {
  Sentry.withScope((scope) => {
    scope.setExtra("promiseReason", event.reason);
    scope.setTag("unhandled", "promise");
    Sentry.captureException(new Error(`Unhandled rejection: ${event.reason}`));
  });
});

📊 Section 3: Alert on Symptoms, Not on Signals

Here is the single most important alerting principle: alert on what users feel, not on what your infrastructure measures.

A user feels:

  • The page returning an error (5xx rate spike)
  • The page taking forever to load (p95 latency spike)
  • Not being able to complete an action (checkout failure rate spike)
  • The service being completely unreachable (uptime drop)

A user does NOT feel:

  • "CPU is at 72%"
  • "Memory usage increased 15%"
  • "Build step 4 of 9 took 3 seconds longer than usual"

Infrastructure metrics matter for diagnosis AFTER an alert fires. They are terrible alert triggers because they are noisy, often harmless, and not correlated to actual user pain.

Mental model: You are an ER doctor. You triage on symptoms — "patient is in pain, can't breathe" — not on lab values that are slightly off normal. Lab values help you diagnose after you know there's a problem.

Why it matters: Every unnecessary alert trains your team to ignore alerts. Alert fatigue is how PagerDuty becomes white noise and a real outage goes unacknowledged for forty minutes.

Common mistake: Setting alerts on every log level error or every HTTP 4xx. 4xx errors are mostly user mistakes (wrong password, missing resource). 5xx errors are YOUR mistakes. Alert on your mistakes.

# Example alert rules — written in pseudo-DSL that maps to any alerting tool
# (Sentry, Datadog, PagerDuty, Grafana Alertmanager, etc.)

## GOOD ALERTS — symptom-based, actionable

ALERT "High 5xx Error Rate"
  WHEN: http_errors{status=~"5.."} / http_requests > 0.02  # >2% of requests failing
  FOR: 5 minutes                                            # sustained, not a blip
  SEVERITY: critical
  NOTIFY: on-call engineer (PagerDuty)
  RUNBOOK: https://wiki.internal/runbooks/high-5xx

ALERT "Checkout Failure Spike"
  WHEN: checkout_errors / checkout_attempts > 0.05          # >5% of checkouts failing
  FOR: 3 minutes
  SEVERITY: critical
  NOTIFY: on-call engineer + engineering manager

ALERT "API Latency Degraded"
  WHEN: http_request_duration_p95 > 3000ms
  FOR: 10 minutes
  SEVERITY: warning
  NOTIFY: on-call engineer (email only, not page)

ALERT "New High-Impact Error"
  WHEN: Sentry issue is_new=true AND affected_users > 10
  FOR: immediately
  SEVERITY: warning
  NOTIFY: #eng-alerts Slack channel

## NOISY ALERTS — do NOT put these on-call

# BAD: spikes on every 404 (users mistype URLs constantly)
ALERT "Any 4xx Error"  # <-- delete this

# BAD: CPU is almost never the actual problem users feel
ALERT "CPU > 70%"      # <-- delete this

# BAD: one error could be a test, not an outage
ALERT "Any single error event"  # <-- delete this

🔥 Section 4: Alert Fatigue and Threshold Tuning

Alert fatigue is when your team receives so many alerts that they stop responding to them. It is one of the most documented causes of major outages — not because nobody was watching, but because the real alert looked identical to the fifty false alarms that came before it.

Symptoms of alert fatigue in your team:

  • On-call rotation dreaded instead of respected
  • Alerts acknowledged immediately and closed without investigation
  • "We just ignore that one, it always fires"
  • Real incidents discovered by users, not by alerts

The fix is ruthless tuning: every alert that fires and requires no action is a broken alert. Delete it or raise the threshold until it only fires when action is required.

Mental model: Your alerts are a contract with your future self at 2am. If the alert says "wake up," your half-asleep self needs to be able to open a laptop, read the alert, and immediately know what to check. If that's not true, the alert is broken.

Why it matters: An organization that ignores 80% of its alerts has no effective alerting at all. Fewer, higher-quality alerts are strictly better than comprehensive noisy ones.

Common mistake: Adding more alerts to compensate for missing coverage instead of fixing the alerts that are already broken. More broken alerts compound the problem.

# Alert quality checklist — run this against every alert you own

For each alert, ask:
[ ] Does this alert ONLY fire when a human needs to take action?
[ ] Is the action required clear from the alert body alone?
[ ] Does the alert include a runbook link or triage steps?
[ ] Has this alert fired in the last 30 days?
    - If YES and it required action: keep it
    - If YES and it required NO action: raise threshold or delete
    - If NO: check if the condition is even reachable; maybe delete
[ ] Is the severity accurate?
    - CRITICAL = wake someone up at 2am, revenue/users affected NOW
    - WARNING  = investigate during business hours
    - INFO     = log it, do not page

# Good alert body template (paste into PagerDuty / Grafana / Sentry)

Alert:    Checkout Failure Rate > 5%
Value:    7.2% of checkouts failing (last 5 min)
Service:  payments-api  |  Region: us-east-1
Started:  2026-06-03T02:14Z
Runbook:  https://wiki.internal/runbooks/checkout-failures
Dashboard: https://grafana.internal/d/payments
Recent deploys: v2.4.1 deployed 2026-06-03T01:55Z  ← check this first

🚀 Section 5: Release Health — Did This Deploy Break Anything?

The most actionable moment in error tracking is right after a deploy. If your error rate doubles within ten minutes of a release, the answer is almost always "roll back now, investigate later."

Release health tracking ties every error event to the git SHA or version tag that was running when it happened. Modern error trackers surface this automatically: "this error first appeared in v2.4.1, which was deployed 12 minutes ago, and it now affects 34 users."

Mental model: Your release is a hypothesis — "this code is better than what it replaced." Release health is the A/B test result arriving in real time. A spike in errors is a failed hypothesis. Reject the hypothesis fast.

Why it matters: Every minute a broken deploy stays live is more users hitting the bug and more data corruption risk. A 3-minute rollback beats a 40-minute debugging session while users are suffering.

Common mistake: Waiting for error rate to "stabilize" after a deploy before checking release health. Check immediately — the first five minutes after a deploy are the highest-signal window you have.

// Pairing release tracking with a deploy script
// This assumes you're deploying via a simple shell command or CI step
// The key: always inject GIT_SHA as an env var at build/deploy time

// In your CI pipeline (GitHub Actions example — runs after deploy succeeds):
// env:
//   SENTRY_AUTH_TOKEN: ${{ secrets.SENTRY_AUTH_TOKEN }}
//   GIT_SHA: ${{ github.sha }}

// Notify Sentry that a new release is live (run this in CI after deploy):
// npx @sentry/cli releases new $GIT_SHA
// npx @sentry/cli releases set-commits $GIT_SHA --auto
// npx @sentry/cli releases deploys $GIT_SHA new -e production

// In application code — Sentry.init already picks this up if GIT_SHA is set:
Sentry.init({
  dsn: process.env.SENTRY_DSN,
  release: process.env.GIT_SHA,   // set by CI at build time
  environment: "production",
});

// Now in Sentry's UI you'll see:
// "First seen: v2.4.1 — 14 minutes ago — 34 users affected"
// That sentence is your rollback trigger.

// Simple release health check you can run manually after any deploy:
async function checkReleaseHealth(sentryOrgSlug, projectSlug, release) {
  const resp = await fetch(
    `https://sentry.io/api/0/projects/${sentryOrgSlug}/${projectSlug}/releases/${release}/`,
    { headers: { Authorization: `Bearer ${process.env.SENTRY_AUTH_TOKEN}` } }
  );
  const data = await resp.json();
  const crashFreeRate = data.crashFreeUsers ?? 1;
  if (crashFreeRate < 0.97) {
    console.error(`RELEASE HEALTH DEGRADED: ${(crashFreeRate * 100).toFixed(1)}% crash-free. Consider rollback.`);
    process.exit(1); // fail CI health-check step → triggers rollback workflow
  }
  console.log(`Release healthy: ${(crashFreeRate * 100).toFixed(1)}% crash-free users.`);
}

🛠️ Your Mission

Wire an error tracker into your app and configure one meaningful alert.

  1. Install and initialize Sentry (or equivalent) in your backend and/or frontend. Use process.env.SENTRY_DSN — never commit the DSN to source control.
  2. Set the release at init time using your GIT_SHA or version tag. If you don't have CI injecting this yet, hardcode "dev-test" for now and add a TODO to fix it.
  3. Capture user identity by calling Sentry.setUser({ id, email }) inside your auth middleware immediately after a session is verified.
  4. Add one meaningful breadcrumb to the most important user action in your app (e.g., "user submitted checkout form" with cart total as context).
  5. Trigger a test error by temporarily adding throw new Error("sentry test — delete me") to a route, hit it once, confirm the issue appears in your Sentry dashboard with your user identity attached, then remove the line.
  6. Write one alert rule in Sentry (Alerts → New Alert → Issue Alert) targeting: any new issue that affects more than 5 users within 1 hour. Route it to your email. That's your minimum viable alert.

✅ You're done when…

  • Sentry (or equivalent) is initialized with release, environment, and dsn from env vars — confirmed on your Production-Readiness Checklist
  • A test error appears in your Sentry dashboard with a stack trace, the correct environment tag, and an associated user identity
  • You have deleted or disabled at least one alert that was previously "too noisy to be useful" (or explicitly confirmed you had none, and your one new alert is the first)
  • Your one meaningful alert includes a description of what action to take when it fires (runbook link or inline triage steps)
  • Release tracking is configured so that your Sentry dashboard shows which deploy introduced any given issue

➡️ Next: Performance & Load Testing. Build It Right, Or Don't Build It At All. 🏛️

Always-on rigor toolkit

🏛️ Build It Right, Or Don't Build It At All.