Skip to main content
DevOps, Deployment & Operations
🚀 DevOps & OpsLesson 13 of 13

Capstone: Localhost → Monitored Production Service

Take an app from localhost to a monitored, CI/CD-deployed, observable production service.

Capstone: Localhost → Monitored Production Service

Stage 3 · DevOps, Deployment & Operations · Capstone

Your laptop is not a data center. At some point — and that point is now — you have to take the thing you built and hand it over to the internet, 24 hours a day, with real users depending on it. This capstone is that moment.


🎯 The mission

Take a working application — something you built in a previous track, a side project, or a fresh service you assemble for this capstone — and carry it from localhost:3000 to a live, deployed, monitored, production service that you can operate with confidence. You are not just deploying once and walking away. You are wiring up the full operational loop: secrets managed safely, a pipeline that ships code automatically on green, structured logs you can actually search, a metric that tells you if the service is healthy, an alert that pages you when it isn't, a load test you've actually run, and a runbook a stranger could follow at 2 a.m. when you're asleep.

By the time you're done, you will have proven — to yourself and to anyone who reads your repo — that you ship software responsibly.


🧱 What to do

Work through each layer systematically. Every checkbox maps directly to one or more lessons in D6.

Environments

  • Separate development, staging, and production environments with distinct configs — no shared databases, no shared secrets
  • Document what is different between environments and why in a short ENVIRONMENTS.md

Deployment

  • Deploy the app to a real host (Railway, Render, Fly.io, Vercel, AWS, GCP, Azure — your choice)
  • Confirm the production URL is publicly reachable and returns a valid response

Config and Secrets

  • Every secret (API keys, DB connection strings, auth tokens) lives in the platform's secret store — zero secrets in source code or committed .env files
  • Non-secret config (feature flags, timeouts, region) lives in environment variables that are different per environment
  • Add a startup check that fails fast and loudly if a required env var is missing

CI/CD Pipeline

  • A pipeline config (GitHub Actions, GitLab CI, CircleCI, etc.) that runs on every push to main
  • Pipeline steps: lint → test → build → deploy to staging → (on tag or manual approval) deploy to production
  • The pipeline is the only way code reaches production — no manual git push deploys

Example GitHub Actions excerpt showing the shape of your pipeline:

# .github/workflows/deploy.yml
jobs:
  test-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install
        run: npm ci
      - name: Lint
        run: npm run lint
      - name: Test
        run: npm test
      - name: Build
        run: npm run build
      - name: Deploy to staging
        if: github.ref == 'refs/heads/main'
        run: npx railway up --environment staging
        env:
          RAILWAY_TOKEN: ${{ secrets.RAILWAY_TOKEN }}
      - name: Deploy to production
        if: startsWith(github.ref, 'refs/tags/v')
        run: npx railway up --environment production
        env:
          RAILWAY_TOKEN: ${{ secrets.RAILWAY_TOKEN }}

Logging

  • Structured JSON logs (not console.log("something happened")) on every significant event: request in, request out, errors, slow queries
  • Every log line includes a requestId so you can trace a single request through the system
  • Logs are queryable in your platform's log viewer — demonstrate a search that surfaces a specific request by ID

Metrics and Error Tracking

  • At least one application metric instrumented and visible in a dashboard: request rate, error rate, or p95 latency — pick the one that matters most for your service
  • Error tracking integrated (Sentry is the obvious choice) — unhandled exceptions flow into it automatically with stack traces and context

Alerting

  • One meaningful alert configured: fires when your chosen metric crosses a threshold that indicates real user pain (e.g., error rate > 1% for 5 minutes, or p95 latency > 2 seconds)
  • Alert routes to a real destination — email, Slack, PagerDuty — not just "configured but untested"
  • You have manually verified the alert fires: either trigger a real failure or use a test alert

Performance

  • A load test against the critical path (the endpoint real users hit most) using k6, Locust, or Artillery
  • Results document: what RPS you tested, what latency looked like at P50/P95/P99, where the bottleneck appeared
  • You have fixed at least one thing based on what the load test revealed (or documented why nothing needed fixing)

Runbook and Rollback

  • A RUNBOOK.md in the repo with at minimum: how to deploy, how to roll back, how to check health, how to read the logs, who owns the service
  • A rollback procedure you have actually tested — deploy a version, break something, roll back, confirm the old version is live

Maintenance

  • Dependabot or Renovate configured to open PRs for dependency updates
  • A backup of any persistent data (database dump, object storage export) created and — critically — restored to a staging environment to verify it works
  • A brief disaster recovery note in the runbook: what breaks if the host goes down, what the recovery steps are, what the recovery time target is

🗺️ Run it through B.U.I.L.D.

The B.U.I.L.D. framework shows up in every Stage 3 capstone, and this one leans hardest on D — Document, test, Deploy, and operate. But every letter is present:

  • B — Break it down: map the 13-checkbox list above into a personal sprint. Don't try to do everything in one sitting. Environments and CI/CD unlock everything else — start there.
  • U — Understand the user: your user here is also the on-call engineer at 3 a.m. Write logs and runbook entries for that person. (Often, that person is future you.)
  • I — Integrate: your app from earlier tracks (front end from D1/D2, backend from D3/D4, data from D5) all show up here as the thing being deployed. Stage 3 is cumulative.
  • L — Launch: a real deploy to a real URL. "It works on my machine" is not a launch.
  • D — Document, test, Deploy, and operate: the runbook, the load test, the backup restore, the alert — this is the whole point of D6. The app is already built; now prove you can run it.

🧪 Deliverables

Submit or present all of the following:

  1. Live URL — a publicly accessible production URL that returns a valid response. Link it in your README.md.
  2. CI/CD config — the pipeline file committed to the repo, with a passing pipeline run visible in your CI dashboard (screenshot or link).
  3. Logging evidence — a screenshot or export of a log query in your platform's viewer, showing at least one request traced by requestId end-to-end.
  4. Metric dashboard — a screenshot of your metric (request rate, error rate, or latency) in whatever dashboard you wired up (Grafana, Datadog, Railway metrics, Render metrics, etc.).
  5. Error tracking — a screenshot of at least one captured error in Sentry (or equivalent), showing a stack trace.
  6. Alert evidence — a screenshot of the alert rule configured, plus evidence it fired (email, Slack message, or test-alert confirmation).
  7. Load test result — the k6/Locust/Artillery output summarized: RPS tested, P50/P95/P99 latencies, pass/fail verdict, and one finding you acted on.
  8. RUNBOOK.md — in the repo, covering: deploy procedure, rollback procedure, health checks, log navigation, contacts/ownership.
  9. Backup restore evidence — a brief written record (or screenshot) confirming you restored a backup to staging and the data was intact.

🏆 Stretch goals

You've already done the required work. These push further:

  • Containerize with Docker — write a Dockerfile, build the image locally, push to a registry, and update your deploy to use the container image. Confirm the same image runs in staging and production.
  • Infrastructure as Code — define your cloud resources (host, database, secrets, DNS) in Terraform or Pulumi. Check the IaC into the repo so the environment can be recreated from scratch.
  • Distributed tracing — add OpenTelemetry instrumentation so a single user request produces a trace that spans your API, any downstream services, and your database. View a trace in Jaeger or your platform's trace explorer.
  • SLO + error budget — define a formal Service Level Objective (e.g., "99.5% of requests succeed in under 1 second over a rolling 28-day window"). Set up a burn-rate alert and document how you would respond when the error budget is 50% consumed.
  • Zero-downtime deploys — configure blue-green or rolling deployments so a deploy never produces a visible error spike for users. Run a load test during a deploy to verify.

A runbook skeleton to get you started — expand every section with your actual service's specifics:

# RUNBOOK: [Service Name]

## Owner
- Primary: [your name / GitHub handle]
- Escalation: [teammate or "solo project"]

## Service overview
[One paragraph: what this service does, what breaks if it's down]

## Deploy procedure
1. Merge to `main` — pipeline runs automatically.
2. After staging passes, push a version tag: `git tag v1.2.3 && git push --tags`
3. Pipeline deploys to production. Verify at [production URL].

## Rollback procedure
1. Identify the last known-good tag: `git tag --sort=-creatordate | head -5`
2. Redeploy that tag via CI: re-run the production deploy job with the previous tag.
3. Verify rollback: check the production URL and error tracking for 5 minutes.
4. Target rollback time: < 10 minutes from decision to live.

## Health checks
- HTTP: `GET [production URL]/health` → 200 OK within 2 seconds
- Metric: error rate < 0.5% (check [dashboard link])
- Logs: `requestId` search in [log platform link]

## Alerts
- [Alert name]: fires when [condition]. Action: [first thing to check].

## Backup / restore
- Backup schedule: [daily / weekly / manual]
- Backup location: [storage bucket / export path]
- Restore tested: [date] — [outcome]
- Restore procedure: [steps]

## Disaster recovery
- If host goes down: [steps to redeploy from IaC / platform dashboard]
- RTO target: [e.g., 2 hours]
- RPO target: [e.g., 24 hours / last backup]

✅ You're done when…

  • Production-Readiness Checklist complete — every item in the "What to do" section above has been checked off and you can point to evidence for each one
  • Pre-Ship Checklist passed — secrets are not in source code, the pipeline is green, the production URL is live, at least one health check returns 200
  • You can roll back in under 10 minutes — you have done this at least once, with structured logging, error tracking, and your alert all live and verified during the rollback
  • You have actually restored a backup and run a load test — not just configured them, not "it should work" — you have done the thing and you have the evidence

You made it through Stage 3.

Six tracks. Dozens of lessons. A real codebase, a real database, a real API, a real frontend, real data pipelines, and now a real production service that you operate like an engineer. You came in knowing how to vibe code a front end. You are leaving with the full stack — not just the ability to build it, but the discipline to ship it safely and keep it running.

That is not a small thing. Most people who can build software cannot operate it. You now can.


➡️ Next: You've finished all six engineering tracks. The Grand Capstone awaits — one full-stack production app that proves you're a developer. Build It Right, Or Don't Build It At All. 🏛️

Always-on rigor toolkit

🏛️ Build It Right, Or Don't Build It At All.