Maintenance, Reliability & Disaster Recovery
Stage 3 · DevOps, Deployment & Operations · B.U.I.L.D. letter: D
You shipped. Confetti fell. Then three months later a dependency has a critical CVE, your SSL cert expired at 2 a.m., and nobody knows how to restore the database because "we never actually tested that." Shipping is the start — keeping it alive is the real job.
⚠️ The vibe trap
You vibe-coded something incredible, deployed it, and it worked. So you moved on to the next thing. Six months later, your Node packages are two major versions behind, a transitive dependency has three open CVEs, your Let's Encrypt cert expired because the renewal cron silently failed, and your only "backup" is a mental note that Supabase probably saves stuff somewhere. Vibe coding got you live; maintenance discipline keeps you there. The apps that survive long-term aren't the ones written perfectly — they're the ones that have someone who checks in on them.
🔄 Dependency Hygiene — Don't Let It Rot
Dependencies are living things. They ship security patches, break APIs, and occasionally get abandoned by their maintainers. Ignoring them is like skipping oil changes: nothing bad happens for a while, then everything breaks at once.
The mental model: Think of your package.json (or requirements.txt, go.mod, etc.) as a team roster. Every dependency is a person you're counting on. You should know who's active, who's reliable, and who's phoning it in.
Why it matters: Outdated dependencies are the #1 source of security breaches for web apps. Most breaches don't exploit zero-days — they exploit known vulnerabilities in libraries that were patched six months ago.
# Audit for known vulnerabilities right now
npm audit
# Get a report with severity levels
npm audit --json | jq '.vulnerabilities | to_entries[] | {name: .key, severity: .value.severity}'
# Fix automatically where safe (minor + patch bumps only)
npm audit fix
# See all outdated packages and how far behind you are
npm outdated
# Update a single package to latest (check the changelog first!)
npm update express
# Update all packages to the latest allowed by semver ranges
npx npm-check-updates -u && npm install
# Python equivalent
pip list --outdated
pip install --upgrade package-name
# Check for abandoned/deprecated packages
npx depcheck
Automate it — don't rely on memory. GitHub's Dependabot creates pull requests automatically when your dependencies have updates or known vulnerabilities. Enable it in your repo's Security settings and you'll get a PR within 24 hours of a patch dropping. Renovate Bot is the alternative with more configuration options. Either way: let a bot nag you so you don't have to remember.
# .github/dependabot.yml — drop this in and forget about forgetting
version: 2
updates:
- package-ecosystem: "npm"
directory: "/"
schedule:
interval: "weekly"
day: "monday"
open-pull-requests-limit: 5
labels:
- "dependencies"
ignore:
# Pin major versions manually — review those yourself
- dependency-name: "*"
update-types: ["version-update:semver-major"]
Common mistake: Enabling Dependabot and then letting the PRs pile up unreviewed. 47 open dependency PRs isn't maintenance — it's deferred anxiety. Set a calendar reminder: every Monday, merge or close dependency PRs. If you can't commit to reviewing them, at least enable auto-merge for patch-level security fixes.
💾 Backups, Tested Restores & Disaster Recovery
A backup you've never tested isn't a backup — it's a hope. Disaster recovery is the plan you execute when something goes catastrophically wrong: database corruption, accidental deletion, cloud provider outage, or a bad deploy that wipes production data.
The mental model: There are two numbers that define your disaster recovery posture. Write them down before you need them.
RPO and RTO — The Two Numbers That Define Your DR Plan
======================================================
RPO (Recovery Point Objective)
→ "How much data can we afford to lose?"
→ Measured in time: "We can lose up to 1 hour of data."
→ Drives your backup FREQUENCY.
→ Example: If you back up every 6 hours, your RPO is up to 6 hours.
→ If you can't lose ANY transactions, you need continuous replication.
RTO (Recovery Time Objective)
→ "How fast must we be back online after a disaster?"
→ Measured in time: "We must recover within 4 hours."
→ Drives your recovery AUTOMATION and REHEARSAL.
→ Example: If RTO is 30 minutes, your restore steps can't take 45.
→ If your restore process is undocumented, your RTO is "unknown" (bad).
Example SLAs by app type:
┌─────────────────────────┬──────────────────┬─────────────────────┐
│ App Type │ Typical RPO │ Typical RTO │
├─────────────────────────┼──────────────────┼─────────────────────┤
│ Personal side project │ 24 hours │ 48 hours │
│ Free community tool │ 6 hours │ 8 hours │
│ Paid SaaS (small) │ 1 hour │ 2 hours │
│ E-commerce / payments │ 0 (continuous) │ 15–30 minutes │
│ Healthcare / critical │ 0 (continuous) │ < 15 minutes │
└─────────────────────────┴──────────────────┴─────────────────────┘
Why it matters: Most apps never need disaster recovery — right up until they do, at 2 a.m. on a Sunday, and you're the one who has to fix it. The people who sleep through outages are the ones who tested their restores before they needed them.
Write your DR runbook now (not after the incident):
DR Runbook Template — fill this in for your app
================================================
App name: _______________
Last tested: _______________
BACKUP LOCATIONS
Primary: Supabase automatic backups (free tier: 7 days, daily)
Secondary: pg_dump to S3 bucket s3://myapp-backups/db/ (daily cron)
File storage: S3 versioning enabled on s3://myapp-uploads/
RPO TARGET: 24 hours
RTO TARGET: 4 hours
RESTORE STEPS (DATABASE)
1. Log into Supabase dashboard → project → Backups
2. Select backup from date prior to incident
3. Click "Restore" — this creates a NEW project (does not overwrite)
4. Update DATABASE_URL env var in Vercel to point to new project
5. Run smoke test: curl https://myapp.com/api/health
6. Verify latest 10 user records match expectations
7. Update DNS / confirm old project is paused
RESTORE STEPS (FILE STORAGE)
1. S3 versioning: restore via AWS console → select object → Versions tab
2. Or: aws s3 cp s3://myapp-backups/db/latest.sql.gz ./restore.sql.gz
WHAT TO COMMUNICATE (status page / email template):
"We experienced a data incident at [TIME]. We are restoring from backup.
Expected recovery time: [RTO]. Data may be restored to [DATE] (our RPO).
We will update at [TIME+1HR]."
CONTACTS
Supabase support: https://supabase.com/support
Vercel status: https://vercel-status.com
AWS status: https://status.aws.amazon.com
Common mistake: Testing the backup, not the restore. Many engineers verify that pg_dump ran without errors. Far fewer have ever actually run psql mydb < backup.sql in a test environment and confirmed the data came back. Schedule a quarterly restore drill — it takes 30 minutes and will save you hours of panic someday.
📊 Reliability Targets — SLA, SLO & Error Budgets
You can't improve reliability without defining what "reliable" means. These three terms do that job.
The mental model: Think of reliability like a bank account. You start each month with a budget of allowed downtime. Every minute the app is down (or slow, or erroring) spends from that budget. When the budget runs out, you stop shipping new features and focus entirely on stability.
SLA / SLO / Error Budget — Plain English
=========================================
SLA (Service Level Agreement)
→ A PROMISE to your users, often in your Terms of Service.
→ "We guarantee 99.9% uptime per month."
→ Violation = potential refunds, legal consequences, lost trust.
→ You don't need a formal SLA as a side project, but you should
know what you'd promise if you had paying customers.
SLO (Service Level Objective)
→ Your INTERNAL target, stricter than your SLA.
→ "We aim for 99.95% uptime internally."
→ The buffer between SLO and SLA gives you room to fix things
before you breach your customer promise.
Error Budget
→ How much "bad" you're allowed in a time window.
→ Calculated from your SLO.
Calculating your error budget:
SLO: 99.9% uptime per month
Month = 30 days = 43,200 minutes
0.1% of 43,200 = 43.2 minutes of allowed downtime per month
SLO: 99.95% uptime per month
0.05% of 43,200 = 21.6 minutes of allowed downtime per month
SLO: 99.99% ("four nines")
0.01% of 43,200 = 4.3 minutes of allowed downtime per month
Rule of thumb for side projects and free tools:
99.5% = 3.6 hours/month downtime allowed → reasonable for free apps
99.9% = 43 min/month → appropriate once you have paying users
99.99% = 4 min/month → requires redundancy, auto-failover, on-call
Why it matters: Without a reliability target, every outage feels like a crisis and there's no framework for deciding whether to ship a risky feature. With a target, you can say "we have 30 minutes of error budget left this month — let's delay that deploy until next week."
Common mistake: Confusing uptime with reliability. Your app can be "up" (returning HTTP 200) but serving a blank page, returning stale data, or taking 30 seconds to respond. A complete reliability SLO covers availability (is it reachable?), latency (is it fast?), and correctness (is it returning valid responses?).
🔁 Redundancy & No Single Point of Failure
A single point of failure (SPOF) is any component whose failure takes down your entire app. Your goal is to eliminate them at every layer you control.
The mental model: Redundancy means "two of everything important." If you have one database server, one API server, and one region — any one of those going down means total outage. Redundancy doesn't require infinite money; it requires intentional architecture.
Common SPOFs and How to Eliminate Them
=======================================
SPOF: Single database instance
FIX: Enable read replicas + point-in-time recovery (PITR)
Supabase: Project Settings → Database → enable PITR
Cost on Supabase Pro: ~$0.10/GB/day
SPOF: Single deployment region
FIX: Deploy to multiple regions OR use a global CDN edge network
Vercel: automatically deploys to edge globally — no action needed
Manual: consider Fly.io multi-region or Cloudflare Workers
SPOF: Single API key / single admin account
FIX: Use a secrets manager with rotation; never share credentials
Add a second admin to your Supabase project
Store service keys in Vercel environment variables, not code
SPOF: Single developer knows the system
FIX: Write runbooks. Document deploys. If you got hit by a bus,
could someone else restore the app from your DR runbook?
(This is also the "bus factor" — aim for bus factor ≥ 2)
SPOF: Single cloud provider for everything
FIX: Not always necessary, but know the manual failover path.
What happens if Vercel has an outage? Can you serve static
files from an S3 / Cloudflare Pages backup?
Redundancy checklist by layer:
✓ DNS: Cloudflare (not your registrar's default DNS) — faster, more reliable
✓ CDN: Vercel / Netlify edge — your static assets survive origin outages
✓ Database: daily backups + PITR enabled + tested restore procedure
✓ Auth: backup admin account exists and is documented
✓ Secrets: stored in secret manager, not .env committed to git
✓ Monitoring: uptime alert fires if the app goes down (see D8)
Common mistake: Believing that "the cloud is redundant so I don't have to think about it." Your cloud provider has redundant infrastructure — but your app running on it may still have architectural SPOFs. The 2021 Fastly outage took down Reddit, The Guardian, and the UK government website simultaneously. All of them were on "reliable" infrastructure.
📋 The Recurring Maintenance Checklist
Maintenance isn't a one-time cleanup — it's a rhythm. These tasks need to happen on a schedule. Block calendar time for them.
Recurring Maintenance Checklist
================================
WEEKLY
[ ] Review and merge (or close) open Dependabot/Renovate PRs
[ ] Check error tracking dashboard (Sentry / D9) for new issues
[ ] Scan uptime monitor alerts for anomalies
[ ] Review any rate-limit or quota warnings from cloud providers
MONTHLY
[ ] Run: npm audit && npm outdated — review and act on results
[ ] Rotate any API keys older than 90 days (or per your policy)
[ ] Check SSL/TLS cert expiry: echo | openssl s_client -connect yourapp.com:443 2>/dev/null | openssl x509 -noout -dates
[ ] Verify domain renewal date (log in to registrar — check expiry)
[ ] Review cloud spend vs. budget — look for unexpected cost spikes
[ ] Check storage quotas: database size, file storage, log retention
[ ] Test backup restore in a staging environment (even a spot-check)
[ ] Review and prune stale feature flags, dead environment variables
[ ] Check for any deprecated API warnings in your cloud provider console
QUARTERLY
[ ] Full disaster recovery drill — restore from backup to staging
[ ] Update your DR runbook with any process changes
[ ] Audit user access — remove accounts for people who left the project
[ ] Review and update third-party OAuth app permissions
[ ] Read security advisories for your major dependencies
[ ] Review your SLO metrics for the quarter — are you hitting targets?
[ ] Prune old deployments, unused branches, stale preview environments
[ ] Update CLAUDE.md / team docs to reflect any architecture changes
ANNUALLY
[ ] Renew domain registration (set auto-renew + calendar reminder)
[ ] Review and re-sign any vendor contracts or API terms of service
[ ] Full security review — consider an external audit
[ ] Archive inactive user data per your privacy policy
[ ] Full review of infrastructure costs — is this still the right stack?
Why it matters: Certs expire on a Tuesday. Domains expire on a holiday weekend. API keys leak in a git history audit six months after the fact. Quotas fill up silently until your app starts failing at exactly the wrong moment. None of these are dramatic — they're just scheduled hygiene. The checklist turns "things I should probably do sometime" into "things I do on the second Monday of each month."
Common mistake: Doing all this manually. Most of it can be automated or alerted. Let Dependabot handle dependency PRs. Let your monitoring tool (Uptime Robot, Better Stack) alert on cert expiry. Set a billing alert in AWS/GCP/Vercel for 80% of your budget. The checklist is for the things that genuinely need human judgment — automate the rest.
📐 Technical Debt as Ongoing Upkeep
Technical debt is the cost of shortcuts taken in the past. It's not inherently bad — moving fast to validate an idea is worth it. But unmanaged debt compounds, exactly like financial debt.
The mental model: Every time you write a TODO comment, defer a refactor, or ship a workaround instead of a real fix, you're taking out a loan. The "interest" is paid every time someone touches that code and has to work around the mess you left. At some point the interest is so high the codebase becomes unmaintainable.
Practical approach: Track debt explicitly. A simple markdown file or GitHub Issues label is enough.
Technical Debt Register Template
==================================
Each entry: what it is, why it was deferred, and its blast radius.
ID Area Debt Description Impact Age
TD-001 Auth JWT expiry not refreshed silently Medium 3 mo
TD-002 API /upload endpoint has no rate limit High 1 mo
TD-003 DB users table missing index on email Medium 6 mo
TD-004 Frontend 3 components copy-pasted, not shared Low 2 mo
TD-005 Infra Prod and staging share same DB Critical 5 mo
Priority rules:
Critical: fix in current sprint, no new features until resolved
High: fix within 2 sprints
Medium: schedule in next maintenance window
Low: batch with next related refactor
Rule of thumb: spend 20% of engineering time on debt reduction.
If your debt register has more than 10 High/Critical items,
stop shipping features and hold a "debt sprint."
Why it matters: The apps that die don't die from competition — they die from their own weight. When every new feature takes three times as long because of the mess underneath, the vibe evaporates. Paying down debt is how you protect your ability to keep building the cool stuff.
Common mistake: Treating debt as optional cleanup. It's not. Schedule debt work the same way you schedule feature work. If it only happens "when we have time," it never happens.
🛠️ Your Mission
You now know how to ship. This lesson is about how to keep it alive for years, not weeks. Here's your maintenance setup mission:
Part 1 — Automate dependency hygiene:
- Add a
.github/dependabot.ymlfile to your project repo (use the template above). - Run
npm auditon your project right now. Record the results. Fix any critical or high findings. - Run
npm outdated. Pick the two most outdated non-major dependencies and update them.
Part 2 — Write your DR plan:
- Fill in the DR Runbook Template above for your actual app.
- Define your RPO and RTO targets and write them down.
- Locate your current backup — where is it, how old is it, and have you ever actually restored from it? If not, restore it to a staging environment now.
Part 3 — Build your maintenance rhythm:
- Copy the Recurring Maintenance Checklist and paste it into a
MAINTENANCE.mdin your project root. - Set a recurring calendar event: "Run maintenance checklist" — monthly, 30 minutes.
- Create a GitHub Issues label called
tech-debtand file at least two issues for known shortcuts in your codebase.
✅ You're done when…
- You have completed and checked off every item in the Recurring Maintenance Checklist for the current month and confirmed your app passes the Production-Readiness Checklist
- Your
MAINTENANCE.mdfile is committed to your repo and includes your RPO and RTO targets -
npm auditshows zero critical or high vulnerabilities on your project - A Dependabot or Renovate config file is live in your repo and has opened at least one PR
- You have an actual, tested restore procedure — not just a backup that "probably works"
- Your technical debt register has at least 3 items filed, triaged by impact
➡️ Next: the Capstone — Localhost → Monitored Production Service. Build It Right, Or Don't Build It At All. 🏛️