System Design Template
A reusable, fill-in-the-blanks framework for designing any system — on a whiteboard, in a doc, or in an interview.
Before you write a single line of code, you need a map. This template walks you through every layer of a real system — from requirements to failure modes — so you can make deliberate decisions instead of expensive surprises. Copy it, fill it in, and use it for any project: a URL shortener, a social feed, a payment processor, or your next big idea.
1. Requirements
Functional Requirements — What must the system DO?
List the core operations a user or service needs to perform.
[ ] _______________________________________________________________
[ ] _______________________________________________________________
[ ] _______________________________________________________________
[ ] _______________________________________________________________
Guiding questions:
- What are the primary user actions? (create, read, update, delete, search, stream?)
- Who are the actors? (end user, admin, other services, third-party APIs?)
- What is the single most important thing the system must never fail to do?
Non-Functional Requirements — How well must it do it?
| Property | Target / Constraint |
|---|---|
| Availability | ______ % uptime (e.g., 99.9 % = 8.7 h/yr downtime) |
| Latency (p99) | < ______ ms for ______ operation |
| Consistency | Strong / Eventual / Read-your-writes |
| Durability | Data survives ______ failure scenarios |
| Security | Auth model: ______; PII sensitivity: ___ |
| Compliance | GDPR / HIPAA / none / ______ |
Out of Scope (explicitly excluded)
State what you are NOT building to keep the design focused.
- _______________________________________________________________
- _______________________________________________________________
Mini-example (URL Shortener):
- IN SCOPE: shorten a URL, redirect to original, basic analytics (click count).
- OUT OF SCOPE: custom domains, link expiry, team management, A/B redirect testing.
2. Scale Estimate
Do the math before you pick any architecture. Back-of-envelope is good enough at this stage.
| Metric | Your Estimate |
|---|---|
| Daily Active Users (DAU) | ______ |
| Reads per second (RPS) | ______ (peak) |
| Writes per second (WPS) | ______ (peak) |
| Read:Write ratio | ______ : 1 |
| Avg object/record size | ______ KB |
| Storage growth per day | ______ GB |
| Storage in 5 years | ______ TB |
| Bandwidth (egress/day) | ______ GB |
How to estimate writes → storage:
WPS × avg_record_size_bytes × 86,400 sec/day = bytes/day
bytes/day × 365 × 5 = 5-year storage floor
Mini-example (URL Shortener):
- 100 M URLs shortened total; 10 M redirects/day → ~116 RPS.
- Each URL record ≈ 500 B. 100 M × 500 B = 50 GB total. Easily fits one DB.
- Write rate: ~10 new URLs/sec — low. System is read-heavy (10:1 read:write).
- Implication: cache aggressively for reads; writes are cheap.
3. API Design
List the key endpoints or operations. Use REST, gRPC, or GraphQL notation — your choice.
Method Path / Operation Auth? Notes
------ -------------------------------- ------- ---------------------------
______ ________________________________ ______ ___________________________
______ ________________________________ ______ ___________________________
______ ________________________________ ______ ___________________________
______ ________________________________ ______ ___________________________
Guiding questions:
- What does the client send, and what does it receive back?
- Are any operations idempotent? (PUT, DELETE — should be; POST — design carefully)
- What are the rate-limit surfaces? Which endpoints are abuse targets?
- Do you need pagination? Cursors or offsets?
Mini-example (URL Shortener):
POST /api/shorten { long_url } → { short_code, short_url } GET /:short_code (public) → 301 redirect to long_url GET /api/links/:short_code (authed user) → stats { clicks, created_at } DELETE /api/links/:short_code (authed user) → 204 No Content
4. Data Model
Sketch the key entities and how they relate.
Entities
Table / Collection: ______________________
id : UUID / auto-increment
_____________ : _______ (type, indexed?)
_____________ : _______
_____________ : _______
created_at : timestamp
updated_at : timestamp
Table / Collection: ______________________
id : UUID / auto-increment
_____________ : _______
_____________ : _______
created_at : timestamp
Relationships
[_____________] 1 ──── N [_____________]
[_____________] N ──── N [_____________] (via join table: _______)
Guiding questions:
- What are the most common query patterns? Design indexes for those first.
- Is any data write-once / append-only? (good for event logs, audit trails)
- What can be denormalized for read speed vs. what must stay normalized for consistency?
- Does any entity have a natural partition key for sharding later?
Mini-example (URL Shortener):
Table: links id : BIGINT (auto-increment, PK) short_code : CHAR(7) UNIQUE, indexed long_url : TEXT user_id : UUID FK → users.id (nullable for anonymous) click_count : BIGINT DEFAULT 0 created_at : TIMESTAMPTZ Table: clicks (optional analytics fan-out) id : BIGINT link_id : BIGINT FK → links.id referrer : TEXT clicked_at : TIMESTAMPTZ
5. High-Level Architecture
Draw the boxes and arrows. The placeholder below is a starting point — replace or extend it.
┌─────────────┐ ┌──────────────────────────────────────────┐
│ Client │──────▶│ Load Balancer │
│ (browser / │ │ (L7, TLS termination, WAF) │
│ mobile app) │ └──────────────┬───────────────────────────┘
└─────────────┘ │
▼
┌─────────────────────────┐
│ API Gateway / │
│ Auth Middleware │
└────────┬────────┬────────┘
│ │
┌───────────▼──┐ ┌──▼────────────┐
│ Service A │ │ Service B │
│ (____________)│ │ (____________) │
└───────┬──────┘ └──────┬─────────┘
│ │
┌─────────────▼────────────────▼──────────┐
│ Data Layer │
│ ┌──────────┐ ┌────────┐ ┌─────────┐ │
│ │ Primary │ │ Cache │ │ Queue │ │
│ │ DB │ │(Redis) │ │(Kafka / │ │
│ │(Postgres)│ │ │ │ SQS) │ │
│ └──────────┘ └────────┘ └─────────┘ │
└────────────────────────────────────────────┘
│
┌──────────────▼──────────────┐
│ Object Storage / CDN │
│ (S3 + CloudFront) │
└─────────────────────────────┘
Component notes:
| Component | Role | Your Choice / Note |
|---|---|---|
| Load Balancer | Distribute traffic, health checks | ________________________ |
| API Gateway | Auth, rate limit, routing | ________________________ |
| Service A | ____________________________ | ________________________ |
| Service B | ____________________________ | ________________________ |
| Primary DB | Source of truth | ________________________ |
| Cache | Hot reads, session data | ________________________ |
| Message Queue | Async work, decouple producers | ________________________ |
| Object Storage | Blobs, images, backups | ________________________ |
| CDN | Edge delivery, static assets | ________________________ |
Guiding questions:
- Where does a synchronous request become expensive enough to go async?
- What can be precomputed and cached vs. what must be computed fresh?
- Which services absolutely need to talk to each other vs. which can be decoupled via events?
6. Bottlenecks & Scaling
Start with one DB, one cache, one queue. Then identify the first thing to break under load.
The First Bottleneck
The first thing to break at scale is: ______________________________
Because: ___________________________________________________________
Scaling Playbook (apply in order, don't skip ahead)
| Step | Technique | When to apply | Trade-off |
|---|---|---|---|
| 1 | Add a cache (Redis) | Read RPS hammers the DB | Cache invalidation complexity |
| 2 | Read replicas | Still DB-bound after caching | Replication lag, eventual reads |
| 3 | Horizontal scale app servers | CPU/memory on API tier saturated | Stateless requirement, LB needed |
| 4 | Async via queue | Slow writes / fan-out blocking requests | Complexity, at-least-once risk |
| 5 | DB sharding / partitioning | Single DB hits IOPS ceiling | Cross-shard queries are painful |
| 6 | Dedicated read services | Read patterns diverge wildly from writes | Data duplication, sync overhead |
Your specific plan:
First scaling move: ________________________________________________
Trigger: ___________________________ (metric or event)
Expected outcome: __________________________________________________
Mini-example (URL Shortener): First bottleneck: DB reads on
SELECT long_url WHERE short_code = ?at 116 RPS. Fix: Redis cache keyed by short_code, TTL 24 h. Cache hit rate expected ~95 %. Result: DB read load drops to ~6 RPS. No sharding needed until 100× growth.
7. Failure Modes
For every component, ask: what happens when this dies?
| Component | Failure Mode | Detection | Recovery / Degradation Strategy |
|---|---|---|---|
| ____________ | ______________________________________ | __________________________ | _________________________________________ |
| ____________ | ______________________________________ | __________________________ | _________________________________________ |
| ____________ | ______________________________________ | __________________________ | _________________________________________ |
| ____________ | ______________________________________ | __________________________ | _________________________________________ |
| ____________ | ______________________________________ | __________________________ | _________________________________________ |
Idempotency & Retries
Which operations are idempotent by design? ________________________
Where do you need an idempotency key? ________________________
Retry policy for async jobs: ________________________
- Max retries: ______
- Backoff: exponential / linear / fixed — ______
- Dead-letter queue: yes / no
Graceful Degradation
If the cache is down: ________________________________________
If the queue is down: ________________________________________
If a downstream service is down: ___________________________________
Minimum viable experience: ________________________________________
Mini-example (URL Shortener):
- DB down → 503; no silent fallback (stale redirects are a security risk).
- Cache down → pass-through to DB; latency spikes but system stays up.
- Click analytics queue down → drop click events silently, log metric; core redirect unaffected.
- Idempotency:
POST /shortenwith same long_url by same user → return existing record (upsert on long_url + user_id).
8. Trade-offs
Every design decision has a cost. Make it explicit.
| Decision Made | What You Gained | What You Gave Up |
|---|---|---|
| __________________________________________ | ________________________________ | ___________________________________ |
| __________________________________________ | ________________________________ | ___________________________________ |
| __________________________________________ | ________________________________ | ___________________________________ |
| __________________________________________ | ________________________________ | ___________________________________ |
Common trade-off axes to address:
- Consistency vs. Availability (CAP theorem — which did you lean toward and why?)
- Latency vs. Throughput (did you optimize for speed per request or total throughput?)
- Simplicity vs. Flexibility (did you build for the current scale or the hoped-for scale?)
- Operational cost vs. Engineering cost (managed service vs. self-hosted?)
The biggest trade-off in this design: _____________________________
We accepted it because: ___________________________________________
The assumption that would invalidate this choice: __________________
Mini-example (URL Shortener):
- Chose eventual consistency for click counts (Redis incr + async flush) → gained write speed, gave up exact real-time count.
- Chose CHAR(7) Base62 random code over sequential IDs → gained unpredictability (anti-enumeration), gave up insert locality in the index.
- Chose a single Postgres instance to start → gained simplicity, gave up horizontal write scaling. Revisit at 50 M links.
Filled Sample: URL Shortener at a Glance
| Section | Answer |
|---|---|
| Core function | Take a long URL, return a 7-char short code; redirect on GET |
| Scale | 10 M redirects/day (~116 RPS peak), 10 writes/sec, 50 GB data total |
| First bottleneck | DB reads — fixed with Redis cache (TTL 24 h, ~95 % hit rate) |
| Key API | POST /shorten, GET /:code (redirect), GET /api/links/:code (stats) |
| Primary table | links (short_code UNIQUE, long_url, user_id, click_count) |
| Consistency choice | Eventual for analytics; strong for redirect resolution |
| Biggest trade-off | Random short codes over sequential — unpredictable but index-unfriendly at extreme scale |
| Failure strategy | Cache miss → DB fallback; click queue failure → silent drop (redirect still works) |
Template designed for TOVCDI — free course at hyvecares.org. Copy it, fill it in, and build something real.
Build It Right, Or Don't Build It At All. 🏛️