Incidents & On-Call
Stage 3 · DevOps, Deployment & Operations · B.U.I.L.D. letter: D
Production will break. Not might — will. The engineers who stay calm, communicate clearly, and stop the bleeding before they diagnose the wound are the ones companies trust with their most important systems. This lesson teaches you how to be that engineer.
⚠️ The vibe trap
You vibe-coded a beautiful product, deployed it, and now your phone is blowing up at 2 a.m. The worst instinct is to start hot-patching production live — no branch, no review, no rollback plan, just desperation edits that make things worse faster. The second-worst instinct is to spend three hours finding "whose fault" the outage is instead of spending three minutes reverting to the last known-good state. Panicking is normal; having a process turns panic into progress.
🚨 The Incident Response Loop
Every incident follows the same shape, no matter how complex. Drilling this loop into muscle memory is the skill.
INCIDENT RESPONSE LOOP
──────────────────────────────────────────────────────────────────
1. DETECT → Alert fires, user reports, anomaly detected
2. ASSESS → What is broken? How many users affected? How bad?
3. COMMUNICATE → Post in #incidents: "We are aware, investigating"
4. MITIGATE → Stop the bleeding NOW (rollback/flag off/scale up)
5. FIX → Root-cause and permanent repair (may take hours/days)
6. RESOLVE → Confirm system healthy, close incident
7. REVIEW → Blameless postmortem within 48–72 hours
──────────────────────────────────────────────────────────────────
KEY RULE: Steps 4 and 5 are NOT the same step.
Mitigate first. Root-cause after services are restored.
Mental model: A surgeon does not stop a bleed to write notes. They stop the bleed, stabilize the patient, and write notes after. Your users are the patient.
Why this matters: Merging a rushed fix at 2 a.m. under pressure introduces more bugs 70% of the time (industry lore backed by every postmortem database ever written). A rollback to a known-good state takes 30 seconds and gives you all the time in the world to find the real problem.
Common mistake: Skipping step 3 (communicate) because you're busy fixing. Users and stakeholders sitting in silence assume the worst. A single "we know, we're on it" message buys enormous goodwill.
🟥 Severity Levels
Not everything is a P1. Having defined severity levels prevents every incident from feeling like the end of the world and helps you decide how fast to escalate.
| Severity | Label | Definition | Response Target | Example |
|---|---|---|---|---|
| SEV-1 | Critical | Full outage, data loss risk, revenue stopped | Respond in 5 min, all hands | Prod DB down, payments failing |
| SEV-2 | High | Major feature broken for many users | Respond in 15 min, team lead | Login broken, checkout errors |
| SEV-3 | Medium | Partial degradation, workaround exists | Respond in 1 hour, engineer | Slow dashboard, one region down |
| SEV-4 | Low | Minor bug, cosmetic, rare edge case | Next business day | Typo in error message, broken image |
Mental model: Severity is about user impact right now, not about how interesting the bug is technically. A memory leak that will crash in 6 hours is SEV-3. A payment processor that is down right now is SEV-1.
Why this matters: Without severity levels, every engineer on the team stops what they are doing every time any alert fires. That is how you burn out a team in six months.
Common mistake: Assigning severity based on how stressed you feel rather than objective impact. Write down the criteria before an incident happens so you don't decide under pressure.
🩹 Mitigate First — Stop the Bleeding Fast
The fastest mitigation strategies, in order of preference: rollback the deploy, flip a feature flag off, scale up the instances, redirect traffic. All of these take less than two minutes. Diagnosis takes hours.
# Quick mitigation: roll back to the previous Vercel deployment
# (works for any Next.js / Vercel project)
# 1. List recent deployments to find the last known-good URL
npx vercel ls --prod
# 2. Promote the previous deployment to production immediately
npx vercel promote <deployment-url> --prod --yes
# 3. Verify the rollback landed
curl -o /dev/null -s -w "%{http_code}" https://yourdomain.com/api/health
# Expect: 200
# ── Feature flag kill switch (if using a flag like LaunchDarkly / Flagsmith) ──
# Turn off the broken feature server-side — zero deploy needed
# flag: new-checkout-flow → set to OFF for 100% of users
# This is why you wrap new features in flags BEFORE shipping them (see Lesson 8)
# ── Scale up if the issue is traffic/load ──
# Vercel auto-scales, but for a Node/Docker service:
# fly scale count web=6 --app your-app-name
Mental model: Feature flags are a circuit breaker. You ship the code, but the flag controls whether users see it. When something breaks, you flip the breaker rather than re-deploying.
Why this matters: A rollback or flag flip gets users off the broken path in under two minutes. A proper fix might take two hours. Those are two hours of revenue, trust, and support tickets you just saved.
Common mistake: Not having a rollback plan at all because "it worked in staging." Every deploy needs a documented one-sentence rollback procedure before it ships. That is part of your Definition of Done now.
📟 On-Call Basics — Sustainable, Not Heroic
On-call means someone is reachable and responsible for incidents outside business hours. Done wrong, it destroys engineers. Done right, it is a professional skill with clear boundaries.
ON-CALL STRUCTURE (small team version)
─────────────────────────────────────────────────────────────────
ROTATION One engineer is primary on-call per week (7-day rotation).
A second engineer is secondary (escalation backup).
Nobody is on-call alone forever.
ESCALATION Primary gets alert → 5 min no response → secondary paged
Secondary → 5 min no response → team lead/founder paged
Always have a chain. Never a single point of human failure.
HANDOFF End of week: primary writes a 5-line status note
(what happened, what's wobbly, what to watch)
and hands off to the next person. No silent transfers.
LIMITS On-call does NOT mean "available for feature work."
It means: respond to SEV-1/SEV-2 alerts.
Compensate on-call hours if a team has > 2 wake-ups/week.
TOOLING PagerDuty / Opsgenie / Better Uptime / plain SMS.
Alerts route to the primary. Silence them at SEV-4.
Every alert that fires must be actionable — no noise.
─────────────────────────────────────────────────────────────────
Mental model: On-call is a relay race, not a marathon. Each runner carries the baton for one leg and hands it off clean. No one runs the whole race.
Why this matters: "Hero culture" — one engineer always online, always fixing things — looks productive until that engineer quits or burns out, and all institutional knowledge walks out with them. Sustainable rotations build resilient teams.
Common mistake: Never rotating, never documenting runbooks, and expecting the same person to answer every alert because "they know the system best." That is a bus-factor-1 timebomb. Write the runbook.
📋 The Blameless Postmortem
A postmortem is a structured reflection held within 48–72 hours of every SEV-1 or SEV-2. The goal is to understand the system well enough to prevent recurrence — not to find someone to blame.
BLAMELESS POSTMORTEM TEMPLATE
════════════════════════════════════════════════════════════════
INCIDENT TITLE: [Brief description — e.g., "Checkout Payments Down 47 min"]
DATE / DURATION: [2026-06-01 · 02:14 UTC → 03:01 UTC · 47 minutes]
SEVERITY: SEV-1
AUTHOR(S): [Names of engineers who led response]
STATUS: Resolved ✓
── IMPACT ────────────────────────────────────────────────────
Users affected: ~1,200 active sessions
Revenue impact: ~$840 in failed transactions (estimated)
Features down: Checkout flow, order confirmation emails
── TIMELINE ──────────────────────────────────────────────────
02:14 PagerDuty alert: payment error rate > 5% (threshold: 1%)
02:18 On-call (Alex) acknowledged, started investigation
02:22 Identified: new Stripe SDK version in deploy at 01:58
02:24 Initiated rollback to previous deployment
02:31 Rollback confirmed live, error rate dropped to 0%
02:45 Root cause confirmed in SDK changelog (breaking API change)
03:01 Incident closed; monitoring for 30 min with no recurrence
── ROOT CAUSE ────────────────────────────────────────────────
The Stripe SDK was upgraded from v12.3.1 to v13.0.0 in a
dependency bump PR. v13 changed the PaymentIntent API response
shape. Our payment handler expected the v12 shape and threw
on every checkout attempt. The PR lacked integration tests
against the real Stripe sandbox response format.
── WHAT WENT WELL ────────────────────────────────────────────
• Alert fired within 16 minutes of deploy (monitoring works)
• Rollback took 7 minutes from acknowledgment to live
• Team communication in #incidents was clear throughout
── WHAT WENT WRONG ──────────────────────────────────────────
• No integration test covered the Stripe payment response shape
• Dependency bumps were not flagged for extra review
• No staging smoke test ran against the payment flow post-deploy
── ACTION ITEMS ─────────────────────────────────────────────
| # | Action | Owner | Due |
|---|------------------------------------------------|--------|------------|
| 1 | Add integration test: Stripe PaymentIntent e2e | Alex | 2026-06-08 |
| 2 | Add CI rule: flag major version dep bumps | Jordan | 2026-06-06 |
| 3 | Add staging smoke test step to deploy pipeline | Alex | 2026-06-10 |
| 4 | Document rollback procedure in runbook | Morgan | 2026-06-07 |
── LESSONS LEARNED ──────────────────────────────────────────
"We trusted the tests we had rather than testing the integration
we changed. Every third-party API contract should have at least
one test that hits a sandbox with real response shapes."
BLAME STATEMENT: No individual is blamed. The system allowed a
breaking SDK change to reach production without detection.
The action items harden the system so the same failure
mode cannot recur regardless of who makes the next dep bump.
════════════════════════════════════════════════════════════════
Mental model: Blame the process, not the person. The question is never "who pushed the bad code?" The question is always "why did our system let bad code reach production?" Those are very different questions with very different answers.
Why this matters: Engineers who fear blame hide incidents, avoid deploying, and stop taking ownership. Blameless culture creates the psychological safety that lets teams move fast and recover faster.
Common mistake: Writing postmortems that identify a human as the root cause ("Alex should have checked the changelog"). Alex is not the root cause. The absent integration test is the root cause. Fix the system.
🛠️ Your Mission
Write a one-page Incident Runbook and a Postmortem template for your own app. Paste both into a docs/runbooks/ folder in your repo.
Your Runbook must answer these five questions:
- Where do I look first? (links to your Vercel/Fly/Railway dashboard, your logging service, your DB console)
- How do I roll back? (the exact one-liner command or UI steps for your platform)
- Who do I tell? (Slack channel name, stakeholder contact — even if it's just you)
- What are my severity levels? (adapt the table above to your actual features)
- What is the on-call rotation? (even solo: "I am always primary; I escalate to [mentor/friend] if I am unreachable")
Your Postmortem template must include:
- Impact, Timeline, Root Cause, What Went Well, What Went Wrong, Action Items with owners and due dates, and the Blame Statement ("No individual is blamed...")
The templates in this lesson are real. Copy them, customize the service names and thresholds, and commit them. A runbook that exists is infinitely better than a perfect runbook you haven't written yet.
✅ You're Done When…
- You have completed the Production-Readiness Checklist (a written document in your repo at
docs/runbooks/incident-checklist.mdcovering all five runbook questions, severity table, rotation plan, and at least one rollback procedure with the exact command) - You have written a blameless postmortem for a real or simulated past incident in your app (even a staging breakage counts — practice the format)
- Every one of your deploys has a one-sentence rollback procedure documented before it ships
- You have at least one alert configured (uptime, error rate, or latency) that would have caught your last outage faster
- Your team (even if it's just you) knows the five-step incident loop without looking it up
➡️ Next: Maintenance, Reliability & Disaster Recovery. Build It Right, Or Don't Build It At All. 🏛️