Skip to main content
Architecture & System Design
📐 ArchitectureLesson 9 of 13

Scaling Fundamentals

Vertical, horizontal, load balancing, statelessness — how systems handle growth.

Scaling Fundamentals

Stage 3 · Architecture & System Design · B.U.I.L.D. letter: D

You vibe-coded something real, shipped it, and people showed up. Now a little voice in the back of your head asks: "What happens if a thousand people show up at once?" That voice is your engineering instincts waking up. This lesson teaches you how systems actually grow — and which decisions you made on day one will either let you grow painlessly or force you to rewrite everything at the worst possible moment.


⚠️ The vibe trap

"We'll scale when we're big" sounds reasonable until you realise that the two hours before your product goes viral are not the right time to re-architect your session management. The traps are subtle: storing user session data in a variable on your server's memory, writing to a local file for uploads, or assuming there will always be exactly one running process. None of those things break when you have ten users. Every single one of them breaks the moment you try to run a second server — and running a second server is the first thing you'll need to do when traffic arrives. The decisions that make scaling possible or impossible are made on day one, invisibly, while you're focused on features.


📦 Vertical vs Horizontal Scaling — and Where Each Breaks

Vertical scaling means giving your existing server more power: more CPU cores, more RAM, faster disk. It's the easiest move because nothing in your code needs to change.

Horizontal scaling means running more copies of your server in parallel, each handling a share of the traffic. It's the move that lets you handle truly large load — but it requires your code to be ready for it.

VERTICAL SCALING
─────────────────────────────────────────
  Before          After
  ┌───────┐       ┌───────────────────┐
  │ 2 CPU │  →    │ 32 CPU  128 GB    │
  │ 8 GB  │       │ fastest box money │
  └───────┘       └───────────────────┘
  Pros: zero code changes, instant
  Hard ceiling: biggest box costs $10k+/month
  Single point of failure: one box = one crash kills everyone

HORIZONTAL SCALING
─────────────────────────────────────────
  ┌──────┐   ┌──────┐   ┌──────┐
  │ app  │   │ app  │   │ app  │  ← N identical small boxes
  │  #1  │   │  #2  │   │  #3  │
  └──────┘   └──────┘   └──────┘
  Pros: near-infinite ceiling, redundancy built in
  Requires: stateless servers (explained next)
VerticalHorizontal
Code changes needed?NoYes (statelessness)
Cost ceilingVery high — and hits hardGradual, pay as you grow
Failure modeOne crash = full outageOne crash = N-1 servers keep running
When to useFirst move, buying timeWhen vertical ceiling is in sight

Common mistake: Jumping to Kubernetes and microservices before you have a traffic problem. Vertical scaling is underrated. A $200/month box handles a surprising amount of traffic. Reach for horizontal scaling when you can measure that vertical is no longer enough — not before.


🔗 Statelessness — The Key That Unlocks Horizontal Scaling

This is the single most important concept in this lesson. A stateful server remembers things about a specific user between requests. A stateless server treats every request as if it arrived from a stranger.

Here is the problem: if server #1 stored your login session in its own memory, and the load balancer sends your next request to server #2 — you're logged out. The state lived on a machine, not in a shared place.

BROKEN (stateful servers)
───────────────────────────────────────────────────────
  User A logs in        User A makes request #2
       │                       │
       ▼                       ▼
  ┌─────────┐           ┌─────────┐
  │ Server 1│ ← session │ Server 2│ ← NO session → 401!
  │ in RAM  │           │         │
  └─────────┘           └─────────┘

FIXED (stateless servers + shared session store)
───────────────────────────────────────────────────────
  User A logs in        User A makes request #2
       │                       │
       ▼                       ▼
  ┌─────────┐           ┌─────────┐
  │ Server 1│           │ Server 2│
  └────┬────┘           └────┬────┘
       │  both read/write    │
       ▼                     ▼
  ┌─────────────────────────────┐
  │      Redis / Supabase DB    │  ← session lives here, not on any server
  └─────────────────────────────┘

The fix is to move all server-side state to a shared external store — a database, Redis, or a JWT token the client carries itself. Once every server can handle any request without knowing which server handled the last one, you can run as many servers as you want.

// BEFORE — session lives on the server process (breaks at server #2)
app.use(session({
  secret: 'keyboard cat',
  resave: false,
  saveUninitialized: false,
  // store: not set → defaults to MemoryStore ← THE PROBLEM
}));

// AFTER — session lives in Redis, any server can read it
import RedisStore from 'connect-redis';
import { createClient } from 'redis';

const redisClient = createClient({ url: process.env.REDIS_URL });
await redisClient.connect();

app.use(session({
  store: new RedisStore({ client: redisClient }),
  secret: process.env.SESSION_SECRET,
  resave: false,
  saveUninitialized: false,
}));

Common mistake: "I'm using JWTs so I'm fine." Mostly true — JWTs are stateless by design. But watch out for anything your server stores in a local variable or file between requests: upload temp files, in-memory queues, rate-limit counters. Each of those is a hidden stateful dependency waiting to bite you.


⚖️ Load Balancers — The Traffic Director

Once you have multiple stateless servers, something has to decide which server gets each incoming request. That's a load balancer. Think of it as a smart receptionist: it knows how many people each server is currently handling, and it routes new arrivals to whoever has the most capacity.

                   Internet
                      │
                      ▼
            ┌─────────────────┐
            │  Load Balancer  │   ← single entry point for all traffic
            │  (nginx / ALB)  │
            └──┬────────┬─────┘
               │        │        └─── round-robin, least-connections,
               ▼        ▼             or IP-hash routing
        ┌──────────┐ ┌──────────┐
        │  App #1  │ │  App #2  │   ← identical, stateless
        │ (Node)   │ │ (Node)   │
        └────┬─────┘ └────┬─────┘
             │             │
             └──────┬──────┘
                    ▼
          ┌─────────────────┐
          │   Shared Store  │   ← Postgres + Redis
          │   (one source   │
          │    of truth)    │
          └─────────────────┘

The load balancer also does health checks: it pings each server every few seconds, and if a server stops responding, it stops sending traffic there until it recovers. That's how horizontal scaling gives you redundancy for free.

Common mistake: Adding a load balancer but forgetting sticky sessions on a stateful app. "Sticky sessions" (always send the same user to the same server) is a band-aid that re-introduces the problem statelessness was supposed to solve. Fix the statelessness instead.


🗄️ The Database Is Almost Always the First Bottleneck

Here is the uncomfortable truth: you added three stateless app servers, your CPU is happy — and your response times are still terrible. Nine times out of ten the database is the culprit. App servers are stateless and easy to clone. The database is stateful and hard to clone, so it becomes the single choke point that all your shiny parallel app servers funnel into.

The escalation ladder for a relational database looks like this:

SCALE-UP LADDER (do in order, measure between each step)
─────────────────────────────────────────────────────────

Step 1: Add an index
        Most queries that hurt are just missing an index.
        Cost: $0. Time: 5 minutes. Check this FIRST.

Step 2: Add a connection pool (e.g. PgBouncer)
        Each app server opens DB connections. 10 servers × 20 connections
        = 200 connections. Postgres struggles above ~200.
        A pool multiplexes many app connections over fewer DB connections.

Step 3: Add a read replica
        ┌──────────────────────────────────────────┐
        │            Primary DB (writes)           │
        │  ← all INSERT / UPDATE / DELETE go here  │
        └──────────────────────┬───────────────────┘
                               │  replication stream
              ┌────────────────▼────────────────────┐
              │         Read Replica(s)              │
              │  ← SELECT queries go here            │
              └──────────────────────────────────────┘
        80-90% of traffic in most apps is reads. Offload reads to a
        replica and your primary DB suddenly has a lot more headroom.

Step 4: Caching (see Lesson 11)
        Stop hitting the database at all for data that doesn't change
        often. This is almost always more impactful than replicas.

Step 5: Sharding (much later, much harder)
        Split the data itself across multiple databases by some key
        (e.g. user_id % 4). Complicates every query. Don't go here
        until you've exhausted every other option. Most apps never need it.

Common mistake: Premature sharding. Engineers read about how Instagram shards its database and plan to do the same on week two of their startup. Sharding introduces enormous complexity — cross-shard joins, rebalancing, distributed transactions. You will almost certainly never need it. Exhaust indexing, pooling, replicas, and caching first.


📏 Scale the Bottleneck, Not Everything — and Measure First

Every dollar and every hour you spend scaling something that isn't the bottleneck is wasted. The bottleneck is the one resource whose limit is being hit while everything else has headroom. Scaling anything else is just moving deck chairs.

Before you change anything, measure:

  • Response time percentiles (p50, p95, p99) — median is misleading; your slowest 1% of users are real people
  • CPU and memory on app servers — are they actually saturated?
  • DB query time — your ORM or Supabase dashboard will show the slow queries
  • Error rate under load — a spike in 5xx errors tells you where something is falling over

If CPU on your app servers is at 10% but your DB queries are averaging 800ms, adding a second app server will not help. Fix the slow query.

Common mistake: "Let's add a cache for everything." Caching adds complexity, cache-invalidation bugs, and stale-data risk. Cache only the data you have measured is slow to fetch, changes infrequently, and is safe to serve slightly stale. (More on this in Lesson 11.)


🛠️ Your Mission

Open your current project and do this investigation:

  1. Find your stateful dependencies. Search your code for session, MemoryStore, fs.writeFile, any in-memory array or Map that grows with usage. List everything that is stored on the server process rather than in a database or shared cache.

  2. Pick one and make it stateless. If you have an in-memory session store, swap it to a database-backed store. If you have a local-file upload path, note what it would take to point it at object storage instead.

  3. Sketch a load-balanced architecture. Draw (even on paper) the three-box diagram: load balancer → two app servers → shared DB/Redis. Annotate each arrow with what data flows across it. Can you trace any request end-to-end without it touching server-local state?

  4. Find the likely first bottleneck. Look at your slowest route. Open your database dashboard and find the query behind it. Is there an index on the column you're filtering by? If not, add one and measure the difference.


✅ You're done when…

  • You can name every place your current app stores state on the server process (RAM, local disk, in-process cache) and describe how you would move each one to a shared external store
  • You can draw the load balancer + stateless app servers + shared DB diagram from memory and explain why the load balancer's health-check makes the system more reliable than a single server
  • You have checked your Production-Readiness Checklist (from the Institute's System Design Template) against your app and identified at least one scaling concern it surfaces that you hadn't considered
  • You can explain the difference between a read replica and sharding, and state which one you would reach for first (and why sharding is a last resort)
  • You have run at least one database query through your DB dashboard and confirmed whether the filtered columns are indexed

➡️ Next: Queues & Event-Driven Architecture. Build It Right, Or Don't Build It At All. 🏛️

Always-on rigor toolkit

🏛️ Build It Right, Or Don't Build It At All.