Skip to main content
← The Owens Vibe Coding & Development Institute
Rigor toolkit

System Design Template

The whiteboard framework: requirements → estimate → design → scale → failure modes.

System Design Template

A reusable, fill-in-the-blanks framework for designing any system — on a whiteboard, in a doc, or in an interview.

Before you write a single line of code, you need a map. This template walks you through every layer of a real system — from requirements to failure modes — so you can make deliberate decisions instead of expensive surprises. Copy it, fill it in, and use it for any project: a URL shortener, a social feed, a payment processor, or your next big idea.


1. Requirements

Functional Requirements — What must the system DO?

List the core operations a user or service needs to perform.

[ ] _______________________________________________________________
[ ] _______________________________________________________________
[ ] _______________________________________________________________
[ ] _______________________________________________________________

Guiding questions:

  • What are the primary user actions? (create, read, update, delete, search, stream?)
  • Who are the actors? (end user, admin, other services, third-party APIs?)
  • What is the single most important thing the system must never fail to do?

Non-Functional Requirements — How well must it do it?

PropertyTarget / Constraint
Availability______ % uptime (e.g., 99.9 % = 8.7 h/yr downtime)
Latency (p99)< ______ ms for ______ operation
ConsistencyStrong / Eventual / Read-your-writes
DurabilityData survives ______ failure scenarios
SecurityAuth model: ______; PII sensitivity: ___
ComplianceGDPR / HIPAA / none / ______

Out of Scope (explicitly excluded)

State what you are NOT building to keep the design focused.

- _______________________________________________________________
- _______________________________________________________________

Mini-example (URL Shortener):

  • IN SCOPE: shorten a URL, redirect to original, basic analytics (click count).
  • OUT OF SCOPE: custom domains, link expiry, team management, A/B redirect testing.

2. Scale Estimate

Do the math before you pick any architecture. Back-of-envelope is good enough at this stage.

MetricYour Estimate
Daily Active Users (DAU)______
Reads per second (RPS)______ (peak)
Writes per second (WPS)______ (peak)
Read:Write ratio______ : 1
Avg object/record size______ KB
Storage growth per day______ GB
Storage in 5 years______ TB
Bandwidth (egress/day)______ GB

How to estimate writes → storage:

WPS  ×  avg_record_size_bytes  ×  86,400 sec/day  =  bytes/day
bytes/day  ×  365  ×  5  =  5-year storage floor

Mini-example (URL Shortener):

  • 100 M URLs shortened total; 10 M redirects/day → ~116 RPS.
  • Each URL record ≈ 500 B. 100 M × 500 B = 50 GB total. Easily fits one DB.
  • Write rate: ~10 new URLs/sec — low. System is read-heavy (10:1 read:write).
  • Implication: cache aggressively for reads; writes are cheap.

3. API Design

List the key endpoints or operations. Use REST, gRPC, or GraphQL notation — your choice.

Method    Path / Operation                   Auth?    Notes
------    --------------------------------   -------  ---------------------------
______    ________________________________   ______   ___________________________
______    ________________________________   ______   ___________________________
______    ________________________________   ______   ___________________________
______    ________________________________   ______   ___________________________

Guiding questions:

  • What does the client send, and what does it receive back?
  • Are any operations idempotent? (PUT, DELETE — should be; POST — design carefully)
  • What are the rate-limit surfaces? Which endpoints are abuse targets?
  • Do you need pagination? Cursors or offsets?

Mini-example (URL Shortener):

POST   /api/shorten           { long_url }  → { short_code, short_url }
GET    /:short_code           (public)      → 301 redirect to long_url
GET    /api/links/:short_code (authed user) → stats { clicks, created_at }
DELETE /api/links/:short_code (authed user) → 204 No Content

4. Data Model

Sketch the key entities and how they relate.

Entities

Table / Collection: ______________________
  id            : UUID / auto-increment
  _____________ : _______ (type, indexed?)
  _____________ : _______
  _____________ : _______
  created_at    : timestamp
  updated_at    : timestamp

Table / Collection: ______________________
  id            : UUID / auto-increment
  _____________ : _______
  _____________ : _______
  created_at    : timestamp

Relationships

[_____________]  1 ──── N  [_____________]
[_____________]  N ──── N  [_____________]  (via join table: _______)

Guiding questions:

  • What are the most common query patterns? Design indexes for those first.
  • Is any data write-once / append-only? (good for event logs, audit trails)
  • What can be denormalized for read speed vs. what must stay normalized for consistency?
  • Does any entity have a natural partition key for sharding later?

Mini-example (URL Shortener):

Table: links
  id          : BIGINT (auto-increment, PK)
  short_code  : CHAR(7) UNIQUE, indexed
  long_url    : TEXT
  user_id     : UUID FK → users.id (nullable for anonymous)
  click_count : BIGINT DEFAULT 0
  created_at  : TIMESTAMPTZ

Table: clicks  (optional analytics fan-out)
  id          : BIGINT
  link_id     : BIGINT FK → links.id
  referrer    : TEXT
  clicked_at  : TIMESTAMPTZ

5. High-Level Architecture

Draw the boxes and arrows. The placeholder below is a starting point — replace or extend it.

┌─────────────┐       ┌──────────────────────────────────────────┐
│   Client     │──────▶│              Load Balancer               │
│ (browser /   │       │        (L7, TLS termination, WAF)        │
│  mobile app) │       └──────────────┬───────────────────────────┘
└─────────────┘                       │
                                      ▼
                       ┌─────────────────────────┐
                       │      API Gateway /       │
                       │   Auth Middleware         │
                       └────────┬────────┬────────┘
                                │        │
                    ┌───────────▼──┐  ┌──▼────────────┐
                    │  Service A   │  │  Service B     │
                    │ (____________)│  │ (____________) │
                    └───────┬──────┘  └──────┬─────────┘
                            │                │
              ┌─────────────▼────────────────▼──────────┐
              │               Data Layer                  │
              │  ┌──────────┐  ┌────────┐  ┌─────────┐  │
              │  │ Primary  │  │ Cache  │  │  Queue  │  │
              │  │   DB     │  │(Redis) │  │(Kafka / │  │
              │  │(Postgres)│  │        │  │ SQS)    │  │
              │  └──────────┘  └────────┘  └─────────┘  │
              └────────────────────────────────────────────┘
                                      │
                       ┌──────────────▼──────────────┐
                       │    Object Storage / CDN      │
                       │   (S3 + CloudFront)          │
                       └─────────────────────────────┘

Component notes:

ComponentRoleYour Choice / Note
Load BalancerDistribute traffic, health checks________________________
API GatewayAuth, rate limit, routing________________________
Service A____________________________________________________
Service B____________________________________________________
Primary DBSource of truth________________________
CacheHot reads, session data________________________
Message QueueAsync work, decouple producers________________________
Object StorageBlobs, images, backups________________________
CDNEdge delivery, static assets________________________

Guiding questions:

  • Where does a synchronous request become expensive enough to go async?
  • What can be precomputed and cached vs. what must be computed fresh?
  • Which services absolutely need to talk to each other vs. which can be decoupled via events?

6. Bottlenecks & Scaling

Start with one DB, one cache, one queue. Then identify the first thing to break under load.

The First Bottleneck

The first thing to break at scale is: ______________________________
Because: ___________________________________________________________

Scaling Playbook (apply in order, don't skip ahead)

StepTechniqueWhen to applyTrade-off
1Add a cache (Redis)Read RPS hammers the DBCache invalidation complexity
2Read replicasStill DB-bound after cachingReplication lag, eventual reads
3Horizontal scale app serversCPU/memory on API tier saturatedStateless requirement, LB needed
4Async via queueSlow writes / fan-out blocking requestsComplexity, at-least-once risk
5DB sharding / partitioningSingle DB hits IOPS ceilingCross-shard queries are painful
6Dedicated read servicesRead patterns diverge wildly from writesData duplication, sync overhead

Your specific plan:

First scaling move: ________________________________________________
Trigger: ___________________________ (metric or event)
Expected outcome: __________________________________________________

Mini-example (URL Shortener): First bottleneck: DB reads on SELECT long_url WHERE short_code = ? at 116 RPS. Fix: Redis cache keyed by short_code, TTL 24 h. Cache hit rate expected ~95 %. Result: DB read load drops to ~6 RPS. No sharding needed until 100× growth.


7. Failure Modes

For every component, ask: what happens when this dies?

ComponentFailure ModeDetectionRecovery / Degradation Strategy
_____________________________________________________________________________________________________________________
_____________________________________________________________________________________________________________________
_____________________________________________________________________________________________________________________
_____________________________________________________________________________________________________________________
_____________________________________________________________________________________________________________________

Idempotency & Retries

Which operations are idempotent by design?  ________________________
Where do you need an idempotency key?       ________________________
Retry policy for async jobs:                ________________________
  - Max retries:  ______
  - Backoff:      exponential / linear / fixed — ______
  - Dead-letter queue: yes / no

Graceful Degradation

If the cache is down:        ________________________________________
If the queue is down:        ________________________________________
If a downstream service is down: ___________________________________
Minimum viable experience:   ________________________________________

Mini-example (URL Shortener):

  • DB down → 503; no silent fallback (stale redirects are a security risk).
  • Cache down → pass-through to DB; latency spikes but system stays up.
  • Click analytics queue down → drop click events silently, log metric; core redirect unaffected.
  • Idempotency: POST /shorten with same long_url by same user → return existing record (upsert on long_url + user_id).

8. Trade-offs

Every design decision has a cost. Make it explicit.

Decision MadeWhat You GainedWhat You Gave Up
_____________________________________________________________________________________________________________
_____________________________________________________________________________________________________________
_____________________________________________________________________________________________________________
_____________________________________________________________________________________________________________

Common trade-off axes to address:

  • Consistency vs. Availability (CAP theorem — which did you lean toward and why?)
  • Latency vs. Throughput (did you optimize for speed per request or total throughput?)
  • Simplicity vs. Flexibility (did you build for the current scale or the hoped-for scale?)
  • Operational cost vs. Engineering cost (managed service vs. self-hosted?)
The biggest trade-off in this design: _____________________________
We accepted it because: ___________________________________________
The assumption that would invalidate this choice: __________________

Mini-example (URL Shortener):

  • Chose eventual consistency for click counts (Redis incr + async flush) → gained write speed, gave up exact real-time count.
  • Chose CHAR(7) Base62 random code over sequential IDs → gained unpredictability (anti-enumeration), gave up insert locality in the index.
  • Chose a single Postgres instance to start → gained simplicity, gave up horizontal write scaling. Revisit at 50 M links.

Filled Sample: URL Shortener at a Glance

SectionAnswer
Core functionTake a long URL, return a 7-char short code; redirect on GET
Scale10 M redirects/day (~116 RPS peak), 10 writes/sec, 50 GB data total
First bottleneckDB reads — fixed with Redis cache (TTL 24 h, ~95 % hit rate)
Key APIPOST /shorten, GET /:code (redirect), GET /api/links/:code (stats)
Primary tablelinks (short_code UNIQUE, long_url, user_id, click_count)
Consistency choiceEventual for analytics; strong for redirect resolution
Biggest trade-offRandom short codes over sequential — unpredictable but index-unfriendly at extreme scale
Failure strategyCache miss → DB fallback; click queue failure → silent drop (redirect still works)

Template designed for TOVCDI — free course at hyvecares.org. Copy it, fill it in, and build something real.

Build It Right, Or Don't Build It At All. 🏛️

Other rigor resources

🏛️ Build It Right, Or Don't Build It At All.