Skip to main content
Architecture & System Design
📐 ArchitectureLesson 13 of 13

Capstone: Design a Complete System

Design a complete system on paper: requirements, architecture, scaling, and failure modes.

Capstone: Design a Complete System

Stage 3 · Architecture & System Design · Capstone

The best engineers don't start with code — they start with a diagram on a napkin. Every system you've admired, every app that handles ten million users without blinking, began as a conversation about boxes and arrows before a single file was created. You now know all the vocabulary. Time to use it.


🎯 The mission

Produce a complete, defensible system design document for a real product — one you've been meaning to build, or one of the suggestions below. This is a design deliverable: a thorough written spec and architecture that proves you understand what you're building, why each piece exists, and exactly where it will break under pressure.

You are not required to ship code (though the stretch goals reward it). You are required to make every decision deliberately and be able to defend each one to a sceptical teammate. If you can design it clearly, you can build it confidently — or hand it off to an AI pair-programmer and get excellent output back.

Suggested systems (pick one, or bring your own):

OptionWhy it's instructive
Link-in-bio app (think Linktree clone)Simple CRUD at the surface; interesting at scale with redirects, analytics, caching
Small marketplace (buyers + sellers + listings)Two-sided data model, payments, search, trust/safety edge cases
Real-time collaborative to-do listWebSockets, conflict resolution, presence, offline-first complications
URL shortener with analyticsClassic interview design; deceptively deep at 100M clicks/day
Your own side projectBest option — you already know the domain

🧱 What to produce

Follow the System Design Template structure below. Every section must be completed — not just filled with words, but filled with decisions.

1. Requirements

Functional requirements — what the system does:

  • List the core user actions (e.g., "A visitor clicks a short link and is redirected within 200 ms").
  • Limit yourself to the 5–8 things the product must do for the MVP.

Non-functional requirements — the quality bar:

  • Target availability (99.9%? 99.99%? justify the number)
  • Consistency model (is it okay to show a stale link for 30 seconds?)
  • Latency SLOs for critical paths
  • Compliance/privacy constraints if any

2. Scale estimate (back-of-envelope)

Work through rough numbers before you draw a single box. Example sketch:

URL shortener — scale estimate
-------------------------------
Writes:   100 new short links / day         →  ~1 write/s peak
Reads:    10 M redirects / day              →  ~120 reads/s avg, ~1 200 reads/s peak
Storage:  1 short link ≈ 500 bytes          →  5 GB / year — fits one Postgres instance
Cache:    top 10 % of links = 90 % of traffic → 1 GB in Redis covers nearly all reads
Bandwidth: avg redirect payload ≈ 1 KB      →  ~1.2 GB / day outbound

Write your own version for your chosen system. These numbers directly determine whether you need a monolith or services, one DB or several, a queue or nothing.

3. API design

Define your core endpoints before you think about internals. Use REST, GraphQL, or tRPC — your choice, but justify it.

POST   /links              — create a short link (auth required)
GET    /{code}             — redirect; 301 for stable, 302 for A/B
GET    /api/links/:id      — fetch link metadata (owner only)
PATCH  /api/links/:id      — update destination or expiry
DELETE /api/links/:id      — soft-delete
GET    /api/links/:id/stats — click counts by day/country (async aggregate)

For each endpoint note: who can call it, what it returns on success, and the most likely error cases.

4. Data model

Sketch your tables or collections. For each entity name the fields, types, and any indices you'll need:

links
  id           uuid  PK
  code         varchar(12)  UNIQUE INDEX  ← the hot lookup
  owner_id     uuid  FK → users
  destination  text
  expires_at   timestamptz  NULLABLE
  created_at   timestamptz

clicks (append-only, high volume)
  id           uuid  PK
  link_id      uuid  FK → links  INDEX
  clicked_at   timestamptz
  country_code char(2)
  referrer     text  NULLABLE

Identify which tables are read-heavy vs write-heavy — they get different treatment (indices, replicas, partitioning).

5. High-level architecture diagram

Draw this with boxes and arrows. Text art is fine; a tool like Excalidraw or draw.io is better. Include every major component:

                        ┌─────────────────────────────┐
                        │         CDN / Edge           │
                        │  (cache 301s at the edge)    │
                        └──────────────┬──────────────┘
                                       │
                        ┌──────────────▼──────────────┐
                        │         API Server(s)        │
                        │  Node / Go / whatever fits   │
                        └─────────┬───────────┬───────┘
                                  │           │
               ┌──────────────────▼──┐   ┌───▼──────────────────┐
               │   Primary DB        │   │   Cache (Redis)       │
               │   (Postgres)        │   │   link code → dest    │
               │   Replica for reads │   │   TTL: 5 min          │
               └──────────────────┬──┘   └──────────────────────┘
                                  │
                        ┌─────────▼──────────┐
                        │   Analytics Queue  │
                        │   (Kafka / SQS)    │
                        └─────────┬──────────┘
                                  │
                        ┌─────────▼──────────┐
                        │  Analytics Worker  │
                        │  (batched inserts  │
                        │   into clicks tbl) │
                        └────────────────────┘

You don't need every component on day one. Annotate which parts are MVP and which are "phase 2."

6. Bottleneck analysis and scaling plan

For each component in your diagram, answer: what breaks first, and how do you fix it?

ComponentBottleneck at scaleHow you'd scale it
API serverCPU-bound at high RPSHorizontal scaling behind a load balancer
Primary DBWrite throughputRead replica; then shard by owner_id if needed
CacheMemory limitsIncrease node size; LRU eviction keeps hot keys
Click ingestionDB write stormQueue all writes; batch-insert every 10 s
CDNAlmost never the bottleneckAdd more edge PoPs if you expand regions

7. Failure modes and resilience

Name at least four failure scenarios and describe exactly how your design handles each:

  • Cache miss storm — what happens if Redis goes down? (Circuit breaker to DB; add back-pressure to avoid thundering herd.)
  • DB primary failover — replication lag + promotion time; can your app tolerate stale reads for 30 s?
  • Queue consumer crash — are click events at risk of being lost? (At-least-once delivery + idempotency key on each click record prevents double-counting.)
  • Duplicate write from retry — what if the client retries a link-creation POST twice? (Idempotency key in the request header; server deduplicates on (owner_id, idempotency_key).)

Describe your degraded-mode behaviour: if analytics fail, does the redirect still work? It should. Separate the critical path from the non-critical path.

8. Monolith vs services decision

State your choice and justify it with reference to your scale estimate and team size:

"At this scale (100 writes/s, team of 2) a single deployable monolith with well-separated modules is correct. The analytics pipeline is the only piece that benefits from independence — so it becomes an async worker process that reads from the queue, not a separate microservice with its own API. We will split further only when a module needs to deploy independently or scale independently."

9. Trade-off register

Every design is a set of bets. Write them down:

DecisionAlternative consideredWhy you chose this one
301 vs 302 redirects302 (no browser caching)301 lets browsers + CDN cache forever; cheaper at scale; acceptable because destinations rarely change
Redis for cachingMemcachedRedis gives persistence, pub/sub, and richer data types at minimal extra cost
Postgres for analyticsClickHouse / BigQueryVolume doesn't justify operational cost of a separate OLAP store until 1B+ rows
Async click ingestionSynchronous DB write per clickSync write would bottleneck the hot redirect path; async + queue is standard pattern

🗺️ Run it through B.U.I.L.D.

You've been using B.U.I.L.D. to structure every lesson. Here's how it maps to this deliverable:

  • B — Break it down: What are the distinct sub-systems? List them before you draw anything.
  • U — Understand requirements: This step is the heart of system design. Spend more time here than anywhere else. Ambiguous requirements produce architectures that solve the wrong problem at enormous cost.
  • I — Implement: Your implementation is the design document. The diagram is the code at this level of abstraction.
  • L — Level up: Review your design against the Production-Readiness Checklist. What does it tell you is missing?
  • D — Deliver: Present your design as if you're defending it to a team. Could someone else build from it without asking you clarifying questions?

The U and I steps are load-bearing here. An unclear requirement in a system design costs ten times more to fix than an unclear requirement in a single function.


🧪 Deliverables

Submit all of these:

  1. The design document — every section of the System Design Template above, filled in for your chosen product. Prose is fine; tables and diagrams are better.

  2. One architecture diagram — the full "boxes and arrows" view of your system. Annotate what's in scope for MVP. Use any tool you like: Excalidraw, draw.io, Mermaid, or plain ASCII art.

  3. A trade-offs section — minimum four rows in the trade-off register. Name the road you didn't take and explain why.

  4. "MVP vs later" plan — a short two-column list:

    Build first (MVP)Build later
    Link creation + redirectAnalytics dashboard
    Basic authSSO / OAuth
    Single-region PostgresRead replica + caching
    Manual deploymentCI/CD pipeline

    This is not a to-do list — it is a prioritisation argument. Explain why the MVP column is sufficient to deliver core value.


🏆 Stretch goals

You've already done the hard intellectual work. Push further:

  • Actually build the core — implement the redirect flow and the link creation endpoint. Does the real code match your design? Where does it diverge, and why?
  • Add a sequence diagram for the critical read path: from the browser clicking a short link to the redirect response, showing every component that touches the request (CDN → API → cache → DB → queue).
  • Estimate cost at 10× and 100× scale — what does your architecture cost on AWS/GCP/etc. today, and what changes at 10× traffic? At 100×? Where does the current design stop being cost-effective?
  • Pressure-test with a team member — present your design and ask them to find at least three ways to break it. Update the failure-modes section with anything you missed.

✅ You're done when…

  • The System Design Template is complete — every section (requirements, scale estimate, API, data model, architecture, bottlenecks, failure modes, monolith/services decision, trade-offs) has a substantive answer, not a placeholder
  • You've run your design against the Production-Readiness Checklist and noted which items are in-scope for MVP and which are deferred, with justification
  • Every bottleneck in your architecture is named and matched to a concrete scaling strategy — "we'd add more servers" is not sufficient; "horizontal API autoscaling behind an ALB, triggered at 70% CPU, with session-less request handling" is
  • The design handles failure deliberately: at least four failure modes are named, and each has a specific mitigation — idempotency keys, retry logic, graceful degradation, or circuit breakers, whichever applies

➡️ Next: Stage 3 continues with Testing, Quality & Craft — make the system you designed a joy to build and change.

Build It Right, Or Don't Build It At All. 🏛️

Always-on rigor toolkit

🏛️ Build It Right, Or Don't Build It At All.