Skip to main content
← The Owens Vibe Coding & Development Institute
Rigor toolkit

Production-Readiness Checklist

The big one: is this actually ready for real users and real scale? Every gate before launch.

Production-Readiness Checklist

Is this actually ready for real users and real scale?

This is the final gate. Before you hand an application to real people — people who trust it with their data, their time, and sometimes their safety — you owe it to them to verify that it is correct, secure, observable, and operable. This checklist covers the full arc of Stage 3: from your first meaningful test to the moment you flip the switch to production. Work through every section. If you cannot check a box, that is not a bureaucratic inconvenience — it is a known risk you are choosing to carry. Know what you are carrying.

No checklist replaces judgment. Use it as a forcing function to have the right conversations before users find the problems for you.


1. Correctness & Tests

  • The critical path (the one flow that breaks the app if it fails) is covered by at least one automated test
  • Unit tests cover non-trivial business logic with meaningful assertions — not just "it returns something"
  • Integration tests exercise the real database or a faithful in-memory substitute
  • At least one end-to-end test walks the most important user journey from the UI through to persistence
  • All tests are deterministic — no flaky tests are tolerated or suppressed
  • CI pipeline runs the full test suite on every push and blocks merges on failure
  • Test coverage is measured; known gaps are documented and accepted consciously, not accidentally
  • A manual smoke test checklist exists and was run against the latest build — by a human, not an assumption
  • Edge cases that have already caused bugs are covered by regression tests
  • Tests run in an environment that is meaningfully similar to production (same DB engine, same env vars structure)

2. Data

  • Every schema migration is written as an explicit, versioned migration file (not a manual ALTER TABLE)
  • Migrations are applied in order in CI and staging before they ever touch production
  • A rollback migration exists for every breaking schema change
  • Automated backups are configured and their retention period is documented
  • A restore from backup has been tested end-to-end — not just assumed to work
  • Multi-step writes that must succeed or fail together are wrapped in transactions
  • There are no orphaned rows or broken foreign-key relationships from untransacted partial writes
  • Indexes exist on every column used in a WHERE or JOIN in high-frequency queries
  • The query plan for the three most-executed queries has been inspected (e.g., EXPLAIN ANALYZE)
  • Soft deletes (deleted_at) or audit columns are in place wherever regulatory or product requirements demand them
  • PII fields are identified; their storage, access, and deletion paths comply with any applicable data regulations
  • Database connection pool limits are set and match the platform's limits — connection exhaustion has been tested

3. Security

The Security Audit Checklist is the deep dive. This section is the mandatory minimum.

  • No secrets (API keys, tokens, passwords, connection strings) exist anywhere in the source code or git history
  • git log -S "sk-" --all (or equivalent) has been run and came back clean
  • Secrets are stored in the platform secret store (Vercel env vars, Supabase vault, etc.), not in .env files committed to the repo
  • Authentication is enforced on every route/endpoint that operates on user data
  • Authorization checks verify that the authenticated user owns or has permission to access the specific resource — not just that they are logged in
  • Row-Level Security (RLS) or equivalent database-level rules are enabled and tested with a user who should NOT have access
  • All user-supplied input is validated server-side before it touches the database or any downstream system
  • The application is not vulnerable to SQL injection (parameterized queries or ORM throughout — no string interpolation into queries)
  • The application is not vulnerable to XSS (output escaped; Content-Security-Policy header set)
  • HTTPS is enforced on all production traffic; HTTP redirects to HTTPS
  • HSTS header is set with an appropriate max-age
  • File uploads (if any) are validated for type and size server-side, stored outside the web root, and served through a proxy — never executed
  • Dependencies have been audited (npm audit, pip-audit, or equivalent); critical and high CVEs are resolved or accepted with documented justification
  • CORS policy is restrictive — not * in production unless explicitly justified
  • Rate limiting is applied to authentication endpoints and any endpoint that accepts user-submitted data
  • Error messages returned to clients do not leak stack traces, internal paths, or database schema details

4. API & Performance

  • Every API endpoint validates its input and returns a meaningful error (not a 500) for bad data
  • HTTP status codes are semantically correct (400 for client errors, 401/403 for auth failures, 404 for not found, 409 for conflicts, 500 for genuine server errors — never 200 with an error body)
  • All list endpoints are paginated — no endpoint returns an unbounded result set
  • Pagination parameters have upper bounds enforced server-side (e.g., limit capped at 100)
  • Rate limiting is configured on public and authenticated endpoints
  • The critical-path request has been load-tested at a realistic concurrent-user volume
  • Sustained load does not cause memory leaks (heap growth was monitored during the load test)
  • N+1 query problems have been identified and resolved on any endpoint that returns a list with nested data
  • Response times for the critical path are within the target SLA under expected peak load
  • Large payloads (images, files, bulk exports) are streamed or pre-signed — not buffered in the server process
  • Caching is applied where appropriate (CDN for static assets; query caching where staleness is acceptable) and cache invalidation is correct

5. Config & Deploy

  • All environment-specific config is read from environment variables — no hardcoded values for URLs, limits, or feature flags
  • The application validates required environment variables at startup and fails fast with a clear error if any are missing
  • There are separate environment configurations for development, staging, and production — staging mirrors production as closely as possible
  • Secrets in production are stored in the platform secret store; no human needs to copy-paste them to deploy
  • The deploy process is scripted or automated — a new developer can deploy without tribal knowledge
  • The deploy process is idempotent — running it twice does not break anything
  • A rollback procedure exists and has been tested: you know exactly what command or button to press to revert to the previous version
  • Zero-downtime deployments are in place (or a maintenance window is explicitly planned and communicated)
  • Database migrations run automatically as part of the deploy pipeline in the correct order
  • Feature flags or environment checks prevent staging-only features from appearing in production

6. Observability

  • Structured logging is enabled — logs are JSON or key-value pairs, not unstructured strings
  • Every request is tagged with a unique request ID that flows through all log lines for that request
  • The application logs at the right levels: DEBUG for development noise, INFO for normal operations, WARN for recoverable anomalies, ERROR for failures that need attention
  • An error-tracking service (Sentry, Bugsnag, or equivalent) is wired up and receiving events from production
  • Unhandled exceptions and promise rejections are captured by the error tracker — not silently swallowed
  • At least one meaningful alert is configured: you are paged or notified when the error rate spikes or the service goes down
  • A key business metric (signups, conversions, transactions processed) is instrumented and visible on a dashboard
  • Logs are retained long enough to diagnose incidents (minimum 30 days; 90 days preferred)
  • The logging pipeline has been tested — you have confirmed that a known error actually shows up in the error tracker

7. Operations

  • A runbook exists that describes: how to start/stop/restart the service; how to run a migration; how to roll back; and who to contact if something breaks
  • The rollback procedure is documented step-by-step and has been executed at least once in a non-production environment
  • Dependency updates are automated or regularly scheduled (Dependabot, Renovate, or a manual monthly review)
  • TLS certificates are monitored for expiry; renewal is automated or calendared with lead time
  • Custom domain DNS records are documented; their TTLs are understood
  • Platform quotas (database connections, serverless function invocations, storage) are known and monitored against current usage
  • A Disaster Recovery plan exists: the Recovery Point Objective (RPO — how much data can you lose?) and Recovery Time Objective (RTO — how long can you be down?) are defined and agreed upon
  • The DR plan has been rehearsed at least once (even on a staging environment)
  • An incident response process exists: who declares an incident, who fixes it, how are users notified, and how is the post-mortem written?
  • Capacity growth triggers are defined: you know at what usage level you need to scale up, and you have a plan for doing so before you hit it

8. Docs

  • The README lets a developer who has never seen this codebase clone the repo, set up their local environment, and run the tests — without asking anyone for help
  • The README documents how to deploy to production, including all required environment variables
  • Key architectural decisions are recorded (in an ADR, a /docs folder, or inline comments) — future maintainers can understand why, not just what
  • The API is documented: either an OpenAPI spec, a Postman collection, or inline JSDoc/docstrings that describe inputs, outputs, and error cases
  • Any non-obvious operational procedures (seed the database, run a backfill, rotate a secret) are documented in the runbook
  • The security model is described somewhere: who can access what, and how access decisions are enforced
  • Known limitations and deferred work are documented — technical debt is explicit, not hidden

Final Gate

Before you ship, answer these four questions honestly:

  1. If this breaks at 2 a.m., can the on-call person diagnose and fix it without you? (Observability + Runbook)
  2. If an attacker gets your database connection string, how much damage can they do? (Security + RLS)
  3. If the primary database goes down, how long until users can work again, and how much data is lost? (DR Plan + Backups)
  4. If a new engineer joins tomorrow, can they ship a change safely within a week? (Docs + CI + Deploy)

If any answer is "I don't know," that is the next thing you build.


Build It Right, Or Don't Build It At All. 🏛️

Other rigor resources

🏛️ Build It Right, Or Don't Build It At All.

Production-Readiness Checklist — TOVCDI | HYVE CARES