Capstone: Localhost → Monitored Production Service
Stage 3 · DevOps, Deployment & Operations · Capstone
Your laptop is not a data center. At some point — and that point is now — you have to take the thing you built and hand it over to the internet, 24 hours a day, with real users depending on it. This capstone is that moment.
🎯 The mission
Take a working application — something you built in a previous track, a side project, or a fresh service you assemble for this capstone — and carry it from localhost:3000 to a live, deployed, monitored, production service that you can operate with confidence. You are not just deploying once and walking away. You are wiring up the full operational loop: secrets managed safely, a pipeline that ships code automatically on green, structured logs you can actually search, a metric that tells you if the service is healthy, an alert that pages you when it isn't, a load test you've actually run, and a runbook a stranger could follow at 2 a.m. when you're asleep.
By the time you're done, you will have proven — to yourself and to anyone who reads your repo — that you ship software responsibly.
🧱 What to do
Work through each layer systematically. Every checkbox maps directly to one or more lessons in D6.
Environments
- Separate
development,staging, andproductionenvironments with distinct configs — no shared databases, no shared secrets - Document what is different between environments and why in a short
ENVIRONMENTS.md
Deployment
- Deploy the app to a real host (Railway, Render, Fly.io, Vercel, AWS, GCP, Azure — your choice)
- Confirm the production URL is publicly reachable and returns a valid response
Config and Secrets
- Every secret (API keys, DB connection strings, auth tokens) lives in the platform's secret store — zero secrets in source code or committed
.envfiles - Non-secret config (feature flags, timeouts, region) lives in environment variables that are different per environment
- Add a startup check that fails fast and loudly if a required env var is missing
CI/CD Pipeline
- A pipeline config (GitHub Actions, GitLab CI, CircleCI, etc.) that runs on every push to
main - Pipeline steps: lint → test → build → deploy to staging → (on tag or manual approval) deploy to production
- The pipeline is the only way code reaches production — no manual
git pushdeploys
Example GitHub Actions excerpt showing the shape of your pipeline:
# .github/workflows/deploy.yml
jobs:
test-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install
run: npm ci
- name: Lint
run: npm run lint
- name: Test
run: npm test
- name: Build
run: npm run build
- name: Deploy to staging
if: github.ref == 'refs/heads/main'
run: npx railway up --environment staging
env:
RAILWAY_TOKEN: ${{ secrets.RAILWAY_TOKEN }}
- name: Deploy to production
if: startsWith(github.ref, 'refs/tags/v')
run: npx railway up --environment production
env:
RAILWAY_TOKEN: ${{ secrets.RAILWAY_TOKEN }}
Logging
- Structured JSON logs (not
console.log("something happened")) on every significant event: request in, request out, errors, slow queries - Every log line includes a
requestIdso you can trace a single request through the system - Logs are queryable in your platform's log viewer — demonstrate a search that surfaces a specific request by ID
Metrics and Error Tracking
- At least one application metric instrumented and visible in a dashboard: request rate, error rate, or p95 latency — pick the one that matters most for your service
- Error tracking integrated (Sentry is the obvious choice) — unhandled exceptions flow into it automatically with stack traces and context
Alerting
- One meaningful alert configured: fires when your chosen metric crosses a threshold that indicates real user pain (e.g., error rate > 1% for 5 minutes, or p95 latency > 2 seconds)
- Alert routes to a real destination — email, Slack, PagerDuty — not just "configured but untested"
- You have manually verified the alert fires: either trigger a real failure or use a test alert
Performance
- A load test against the critical path (the endpoint real users hit most) using k6, Locust, or Artillery
- Results document: what RPS you tested, what latency looked like at P50/P95/P99, where the bottleneck appeared
- You have fixed at least one thing based on what the load test revealed (or documented why nothing needed fixing)
Runbook and Rollback
- A
RUNBOOK.mdin the repo with at minimum: how to deploy, how to roll back, how to check health, how to read the logs, who owns the service - A rollback procedure you have actually tested — deploy a version, break something, roll back, confirm the old version is live
Maintenance
- Dependabot or Renovate configured to open PRs for dependency updates
- A backup of any persistent data (database dump, object storage export) created and — critically — restored to a staging environment to verify it works
- A brief disaster recovery note in the runbook: what breaks if the host goes down, what the recovery steps are, what the recovery time target is
🗺️ Run it through B.U.I.L.D.
The B.U.I.L.D. framework shows up in every Stage 3 capstone, and this one leans hardest on D — Document, test, Deploy, and operate. But every letter is present:
- B — Break it down: map the 13-checkbox list above into a personal sprint. Don't try to do everything in one sitting. Environments and CI/CD unlock everything else — start there.
- U — Understand the user: your user here is also the on-call engineer at 3 a.m. Write logs and runbook entries for that person. (Often, that person is future you.)
- I — Integrate: your app from earlier tracks (front end from D1/D2, backend from D3/D4, data from D5) all show up here as the thing being deployed. Stage 3 is cumulative.
- L — Launch: a real deploy to a real URL. "It works on my machine" is not a launch.
- D — Document, test, Deploy, and operate: the runbook, the load test, the backup restore, the alert — this is the whole point of D6. The app is already built; now prove you can run it.
🧪 Deliverables
Submit or present all of the following:
- Live URL — a publicly accessible production URL that returns a valid response. Link it in your
README.md. - CI/CD config — the pipeline file committed to the repo, with a passing pipeline run visible in your CI dashboard (screenshot or link).
- Logging evidence — a screenshot or export of a log query in your platform's viewer, showing at least one request traced by
requestIdend-to-end. - Metric dashboard — a screenshot of your metric (request rate, error rate, or latency) in whatever dashboard you wired up (Grafana, Datadog, Railway metrics, Render metrics, etc.).
- Error tracking — a screenshot of at least one captured error in Sentry (or equivalent), showing a stack trace.
- Alert evidence — a screenshot of the alert rule configured, plus evidence it fired (email, Slack message, or test-alert confirmation).
- Load test result — the k6/Locust/Artillery output summarized: RPS tested, P50/P95/P99 latencies, pass/fail verdict, and one finding you acted on.
- RUNBOOK.md — in the repo, covering: deploy procedure, rollback procedure, health checks, log navigation, contacts/ownership.
- Backup restore evidence — a brief written record (or screenshot) confirming you restored a backup to staging and the data was intact.
🏆 Stretch goals
You've already done the required work. These push further:
- Containerize with Docker — write a
Dockerfile, build the image locally, push to a registry, and update your deploy to use the container image. Confirm the same image runs in staging and production. - Infrastructure as Code — define your cloud resources (host, database, secrets, DNS) in Terraform or Pulumi. Check the IaC into the repo so the environment can be recreated from scratch.
- Distributed tracing — add OpenTelemetry instrumentation so a single user request produces a trace that spans your API, any downstream services, and your database. View a trace in Jaeger or your platform's trace explorer.
- SLO + error budget — define a formal Service Level Objective (e.g., "99.5% of requests succeed in under 1 second over a rolling 28-day window"). Set up a burn-rate alert and document how you would respond when the error budget is 50% consumed.
- Zero-downtime deploys — configure blue-green or rolling deployments so a deploy never produces a visible error spike for users. Run a load test during a deploy to verify.
A runbook skeleton to get you started — expand every section with your actual service's specifics:
# RUNBOOK: [Service Name]
## Owner
- Primary: [your name / GitHub handle]
- Escalation: [teammate or "solo project"]
## Service overview
[One paragraph: what this service does, what breaks if it's down]
## Deploy procedure
1. Merge to `main` — pipeline runs automatically.
2. After staging passes, push a version tag: `git tag v1.2.3 && git push --tags`
3. Pipeline deploys to production. Verify at [production URL].
## Rollback procedure
1. Identify the last known-good tag: `git tag --sort=-creatordate | head -5`
2. Redeploy that tag via CI: re-run the production deploy job with the previous tag.
3. Verify rollback: check the production URL and error tracking for 5 minutes.
4. Target rollback time: < 10 minutes from decision to live.
## Health checks
- HTTP: `GET [production URL]/health` → 200 OK within 2 seconds
- Metric: error rate < 0.5% (check [dashboard link])
- Logs: `requestId` search in [log platform link]
## Alerts
- [Alert name]: fires when [condition]. Action: [first thing to check].
## Backup / restore
- Backup schedule: [daily / weekly / manual]
- Backup location: [storage bucket / export path]
- Restore tested: [date] — [outcome]
- Restore procedure: [steps]
## Disaster recovery
- If host goes down: [steps to redeploy from IaC / platform dashboard]
- RTO target: [e.g., 2 hours]
- RPO target: [e.g., 24 hours / last backup]
✅ You're done when…
- Production-Readiness Checklist complete — every item in the "What to do" section above has been checked off and you can point to evidence for each one
- Pre-Ship Checklist passed — secrets are not in source code, the pipeline is green, the production URL is live, at least one health check returns 200
- You can roll back in under 10 minutes — you have done this at least once, with structured logging, error tracking, and your alert all live and verified during the rollback
- You have actually restored a backup and run a load test — not just configured them, not "it should work" — you have done the thing and you have the evidence
You made it through Stage 3.
Six tracks. Dozens of lessons. A real codebase, a real database, a real API, a real frontend, real data pipelines, and now a real production service that you operate like an engineer. You came in knowing how to vibe code a front end. You are leaving with the full stack — not just the ability to build it, but the discipline to ship it safely and keep it running.
That is not a small thing. Most people who can build software cannot operate it. You now can.
➡️ Next: You've finished all six engineering tracks. The Grand Capstone awaits — one full-stack production app that proves you're a developer. Build It Right, Or Don't Build It At All. 🏛️