Production-Readiness Checklist
Is this actually ready for real users and real scale?
This is the final gate. Before you hand an application to real people — people who trust it with their data, their time, and sometimes their safety — you owe it to them to verify that it is correct, secure, observable, and operable. This checklist covers the full arc of Stage 3: from your first meaningful test to the moment you flip the switch to production. Work through every section. If you cannot check a box, that is not a bureaucratic inconvenience — it is a known risk you are choosing to carry. Know what you are carrying.
No checklist replaces judgment. Use it as a forcing function to have the right conversations before users find the problems for you.
1. Correctness & Tests
- The critical path (the one flow that breaks the app if it fails) is covered by at least one automated test
- Unit tests cover non-trivial business logic with meaningful assertions — not just "it returns something"
- Integration tests exercise the real database or a faithful in-memory substitute
- At least one end-to-end test walks the most important user journey from the UI through to persistence
- All tests are deterministic — no flaky tests are tolerated or suppressed
- CI pipeline runs the full test suite on every push and blocks merges on failure
- Test coverage is measured; known gaps are documented and accepted consciously, not accidentally
- A manual smoke test checklist exists and was run against the latest build — by a human, not an assumption
- Edge cases that have already caused bugs are covered by regression tests
- Tests run in an environment that is meaningfully similar to production (same DB engine, same env vars structure)
2. Data
- Every schema migration is written as an explicit, versioned migration file (not a manual
ALTER TABLE) - Migrations are applied in order in CI and staging before they ever touch production
- A rollback migration exists for every breaking schema change
- Automated backups are configured and their retention period is documented
- A restore from backup has been tested end-to-end — not just assumed to work
- Multi-step writes that must succeed or fail together are wrapped in transactions
- There are no orphaned rows or broken foreign-key relationships from untransacted partial writes
- Indexes exist on every column used in a
WHEREorJOINin high-frequency queries - The query plan for the three most-executed queries has been inspected (e.g.,
EXPLAIN ANALYZE) - Soft deletes (
deleted_at) or audit columns are in place wherever regulatory or product requirements demand them - PII fields are identified; their storage, access, and deletion paths comply with any applicable data regulations
- Database connection pool limits are set and match the platform's limits — connection exhaustion has been tested
3. Security
The Security Audit Checklist is the deep dive. This section is the mandatory minimum.
- No secrets (API keys, tokens, passwords, connection strings) exist anywhere in the source code or git history
-
git log -S "sk-" --all(or equivalent) has been run and came back clean - Secrets are stored in the platform secret store (Vercel env vars, Supabase vault, etc.), not in
.envfiles committed to the repo - Authentication is enforced on every route/endpoint that operates on user data
- Authorization checks verify that the authenticated user owns or has permission to access the specific resource — not just that they are logged in
- Row-Level Security (RLS) or equivalent database-level rules are enabled and tested with a user who should NOT have access
- All user-supplied input is validated server-side before it touches the database or any downstream system
- The application is not vulnerable to SQL injection (parameterized queries or ORM throughout — no string interpolation into queries)
- The application is not vulnerable to XSS (output escaped; Content-Security-Policy header set)
- HTTPS is enforced on all production traffic; HTTP redirects to HTTPS
- HSTS header is set with an appropriate
max-age - File uploads (if any) are validated for type and size server-side, stored outside the web root, and served through a proxy — never executed
- Dependencies have been audited (
npm audit,pip-audit, or equivalent); critical and high CVEs are resolved or accepted with documented justification - CORS policy is restrictive — not
*in production unless explicitly justified - Rate limiting is applied to authentication endpoints and any endpoint that accepts user-submitted data
- Error messages returned to clients do not leak stack traces, internal paths, or database schema details
4. API & Performance
- Every API endpoint validates its input and returns a meaningful error (not a 500) for bad data
- HTTP status codes are semantically correct (
400for client errors,401/403for auth failures,404for not found,409for conflicts,500for genuine server errors — never200with an error body) - All list endpoints are paginated — no endpoint returns an unbounded result set
- Pagination parameters have upper bounds enforced server-side (e.g.,
limitcapped at 100) - Rate limiting is configured on public and authenticated endpoints
- The critical-path request has been load-tested at a realistic concurrent-user volume
- Sustained load does not cause memory leaks (heap growth was monitored during the load test)
- N+1 query problems have been identified and resolved on any endpoint that returns a list with nested data
- Response times for the critical path are within the target SLA under expected peak load
- Large payloads (images, files, bulk exports) are streamed or pre-signed — not buffered in the server process
- Caching is applied where appropriate (CDN for static assets; query caching where staleness is acceptable) and cache invalidation is correct
5. Config & Deploy
- All environment-specific config is read from environment variables — no hardcoded values for URLs, limits, or feature flags
- The application validates required environment variables at startup and fails fast with a clear error if any are missing
- There are separate environment configurations for development, staging, and production — staging mirrors production as closely as possible
- Secrets in production are stored in the platform secret store; no human needs to copy-paste them to deploy
- The deploy process is scripted or automated — a new developer can deploy without tribal knowledge
- The deploy process is idempotent — running it twice does not break anything
- A rollback procedure exists and has been tested: you know exactly what command or button to press to revert to the previous version
- Zero-downtime deployments are in place (or a maintenance window is explicitly planned and communicated)
- Database migrations run automatically as part of the deploy pipeline in the correct order
- Feature flags or environment checks prevent staging-only features from appearing in production
6. Observability
- Structured logging is enabled — logs are JSON or key-value pairs, not unstructured strings
- Every request is tagged with a unique request ID that flows through all log lines for that request
- The application logs at the right levels:
DEBUGfor development noise,INFOfor normal operations,WARNfor recoverable anomalies,ERRORfor failures that need attention - An error-tracking service (Sentry, Bugsnag, or equivalent) is wired up and receiving events from production
- Unhandled exceptions and promise rejections are captured by the error tracker — not silently swallowed
- At least one meaningful alert is configured: you are paged or notified when the error rate spikes or the service goes down
- A key business metric (signups, conversions, transactions processed) is instrumented and visible on a dashboard
- Logs are retained long enough to diagnose incidents (minimum 30 days; 90 days preferred)
- The logging pipeline has been tested — you have confirmed that a known error actually shows up in the error tracker
7. Operations
- A runbook exists that describes: how to start/stop/restart the service; how to run a migration; how to roll back; and who to contact if something breaks
- The rollback procedure is documented step-by-step and has been executed at least once in a non-production environment
- Dependency updates are automated or regularly scheduled (Dependabot, Renovate, or a manual monthly review)
- TLS certificates are monitored for expiry; renewal is automated or calendared with lead time
- Custom domain DNS records are documented; their TTLs are understood
- Platform quotas (database connections, serverless function invocations, storage) are known and monitored against current usage
- A Disaster Recovery plan exists: the Recovery Point Objective (RPO — how much data can you lose?) and Recovery Time Objective (RTO — how long can you be down?) are defined and agreed upon
- The DR plan has been rehearsed at least once (even on a staging environment)
- An incident response process exists: who declares an incident, who fixes it, how are users notified, and how is the post-mortem written?
- Capacity growth triggers are defined: you know at what usage level you need to scale up, and you have a plan for doing so before you hit it
8. Docs
- The README lets a developer who has never seen this codebase clone the repo, set up their local environment, and run the tests — without asking anyone for help
- The README documents how to deploy to production, including all required environment variables
- Key architectural decisions are recorded (in an ADR, a
/docsfolder, or inline comments) — future maintainers can understand why, not just what - The API is documented: either an OpenAPI spec, a Postman collection, or inline JSDoc/docstrings that describe inputs, outputs, and error cases
- Any non-obvious operational procedures (seed the database, run a backfill, rotate a secret) are documented in the runbook
- The security model is described somewhere: who can access what, and how access decisions are enforced
- Known limitations and deferred work are documented — technical debt is explicit, not hidden
Final Gate
Before you ship, answer these four questions honestly:
- If this breaks at 2 a.m., can the on-call person diagnose and fix it without you? (Observability + Runbook)
- If an attacker gets your database connection string, how much damage can they do? (Security + RLS)
- If the primary database goes down, how long until users can work again, and how much data is lost? (DR Plan + Backups)
- If a new engineer joins tomorrow, can they ship a change safely within a week? (Docs + CI + Deploy)
If any answer is "I don't know," that is the next thing you build.
Build It Right, Or Don't Build It At All. 🏛️