Skip to main content
Data & Databases
🗄️ Data & DatabasesLesson 4 of 13

Normalization & Schema Design

Design schemas that don't rot: avoid duplication, anomalies, and regret.

Normalization & Schema Design

Stage 3 · Data & Databases · B.U.I.L.D. letter: U

You built something real with vibes and instinct. Now one customer changes their email and suddenly half your order history shows the wrong address. That is not a bug in your code — it is a bug in your design. Normalization is the discipline of arranging data so that every fact lives in exactly one place, and changing it once changes it everywhere.


⚠️ The vibe trap

The first schema most people build looks like a spreadsheet: one big table with every piece of information jammed into every row. Customer name, customer email, product name, product price — all duplicated on every order. This works right up until it doesn't: when the customer moves, you update one row and miss twenty others, and now your data is lying to you. Duplication is not just waste — it is a slow leak of trust in your own database.


🔴 What duplication actually costs you — update anomalies

An update anomaly happens when one real-world fact is stored in multiple rows, so a single change requires touching every one of them or accepting inconsistency.

The broken schema — everything in one table:

-- orders_flat: the vibe-coded first draft
CREATE TABLE orders_flat (
  order_id      SERIAL PRIMARY KEY,
  customer_name VARCHAR(120),
  customer_email VARCHAR(200),
  customer_city  VARCHAR(100),
  product_name   VARCHAR(200),
  product_price  NUMERIC(10, 2),
  quantity       INT,
  ordered_at     TIMESTAMPTZ DEFAULT now()
);

-- What your data actually looks like:
-- order_id | customer_name | customer_email       | customer_city | product_name   | product_price | quantity
-- 1        | Alice Chen    | alice@example.com    | Austin        | Laptop Stand   | 29.99         | 1
-- 2        | Alice Chen    | alice@example.com    | Austin        | USB Hub        | 14.99         | 2
-- 3        | Alice Chen    | alice@example.com    | Austin        | Webcam         | 49.99         | 1

Alice's email appears three times. She moves to Denver — now you must update three rows. Miss one and your data contradicts itself.

Mental model: Think of each row as a sentence. In the flat table, every sentence about an order has to re-tell Alice's whole life story. Good design means each fact gets said once.

Why this matters: Inconsistent data causes real harm — wrong shipping addresses, duplicate marketing emails, revenue reports that don't add up. Teams spend enormous time writing "data cleaning" code that exists only because the schema was never normalized.

Common mistake: Thinking this is fine because "I'll just be careful." You won't. Your future co-worker won't. A bug in a loop won't. Let the schema enforce correctness.


🟡 First Normal Form (1NF) — one value per cell, no repeating groups

1NF rule: Every column holds one atomic (indivisible) value. No lists, no comma-separated strings, no repeating column groups like phone1, phone2, phone3.

Before 1NF — repeating columns:

-- This schema stores up to three phone numbers per customer
-- in separate columns. Adding a fourth means altering the table.
CREATE TABLE customers_bad (
  customer_id SERIAL PRIMARY KEY,
  name        VARCHAR(120),
  phone1      VARCHAR(20),
  phone2      VARCHAR(20),
  phone3      VARCHAR(20)
);

After 1NF — separate table for the repeating group:

CREATE TABLE customers (
  customer_id SERIAL PRIMARY KEY,
  name        VARCHAR(120),
  email       VARCHAR(200) UNIQUE NOT NULL
);

-- Now each phone is its own row. You can have zero or twenty.
CREATE TABLE customer_phones (
  phone_id    SERIAL PRIMARY KEY,
  customer_id INT NOT NULL REFERENCES customers(customer_id),
  phone       VARCHAR(20) NOT NULL,
  label       VARCHAR(40)  -- 'mobile', 'work', 'home'
);

Mental model: Each cell should hold one thing that cannot be split further without losing meaning. A phone number is atomic. A list of phone numbers is not.

Why this matters: Storing "555-1234, 555-5678" in one column means you cannot query by individual phone number, cannot index it, and cannot enforce format with a constraint. You have hidden a table inside a string.

Common mistake: Using JSON columns to dodge 1NF. JSON columns are powerful for genuinely flexible data, but when you store ["555-1234","555-5678"] in JSON just to avoid adding a table, you lose queryability. Use a child table instead.


🟠 Second Normal Form (2NF) — no partial dependencies

2NF rule: Every non-key column must depend on the whole primary key — not just part of it. This only applies when you have a composite primary key (a key made of two or more columns).

Before 2NF — order line items with a partial dependency:

-- Composite PK: (order_id, product_id)
-- But product_name depends only on product_id, not on the full pair.
CREATE TABLE order_lines_bad (
  order_id     INT,
  product_id   INT,
  product_name VARCHAR(200),   -- depends only on product_id ← partial dependency
  product_price NUMERIC(10,2), -- depends only on product_id ← partial dependency
  quantity     INT,
  PRIMARY KEY (order_id, product_id)
);

After 2NF — product data lives in its own table:

CREATE TABLE products (
  product_id   SERIAL PRIMARY KEY,
  product_name VARCHAR(200)    NOT NULL,
  product_price NUMERIC(10,2)  NOT NULL CHECK (product_price >= 0)
);

CREATE TABLE order_lines (
  order_id   INT NOT NULL REFERENCES orders(order_id),
  product_id INT NOT NULL REFERENCES products(product_id),
  quantity   INT NOT NULL CHECK (quantity > 0),
  -- Store the price at time of purchase, not the current price
  unit_price NUMERIC(10,2) NOT NULL,
  PRIMARY KEY (order_id, product_id)
);

Mental model: Ask yourself, "If I knew only part of the key, could I determine this column's value?" If yes — it belongs in a separate table keyed by that part alone.

Why this matters: Without 2NF, updating a product's name means finding every order line that mentions it. With 2NF, you update one row in products and every query automatically reflects the change.

Common mistake: Storing unit_price (the price charged at purchase time) in products instead of order_lines. The price in products should be the current selling price. The price the customer actually paid must live on the order line — those are two different facts.


🟢 Third Normal Form (3NF) — no transitive dependencies

3NF rule: Every non-key column must depend on the primary key directly, not through another non-key column.

Before 3NF — a transitive chain:

-- In orders_bad, zip_code determines city and state.
-- city and state depend on zip_code, not on order_id.
-- That is a transitive dependency: order_id → zip_code → city/state
CREATE TABLE orders_bad (
  order_id   SERIAL PRIMARY KEY,
  customer_id INT,
  zip_code   CHAR(5),
  city       VARCHAR(100),   -- determined by zip_code, not order_id
  state      CHAR(2),        -- determined by zip_code, not order_id
  ordered_at TIMESTAMPTZ
);

After 3NF — extract the dependent fact:

-- Zip codes live in their own lookup table.
CREATE TABLE zip_codes (
  zip_code CHAR(5) PRIMARY KEY,
  city     VARCHAR(100) NOT NULL,
  state    CHAR(2)      NOT NULL
);

CREATE TABLE orders (
  order_id    SERIAL PRIMARY KEY,
  customer_id INT         NOT NULL REFERENCES customers(customer_id),
  zip_code    CHAR(5)     REFERENCES zip_codes(zip_code),
  ordered_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

Mental model: Draw arrows. If you can say "column A → column B → column C" where A is the PK and B is not the PK, then B → C is a transitive dependency that belongs in a new table.

Why this matters: The USPS changes a zip code's city assignment (it happens during boundary redrawings). With the transitive dependency in orders, you have thousands of rows to update. With 3NF, you update one row in zip_codes.

Common mistake: Over-normalizing lookup data you will never update. If your app stores order_status as a VARCHAR and the possible values are truly fixed ('pending', 'shipped', 'delivered'), a separate order_statuses table adds complexity with no benefit. Use a PostgreSQL ENUM type instead:

CREATE TYPE order_status AS ENUM ('pending', 'processing', 'shipped', 'delivered', 'cancelled');

ALTER TABLE orders ADD COLUMN status order_status NOT NULL DEFAULT 'pending';

🔵 The full before/after — flat table → normalized schema

Before: one denormalized table (everything in one place):

CREATE TABLE orders_flat (
  order_id       SERIAL PRIMARY KEY,
  customer_name  VARCHAR(120),
  customer_email VARCHAR(200),
  customer_city  VARCHAR(100),
  product_name   VARCHAR(200),
  product_price  NUMERIC(10,2),
  quantity       INT,
  ordered_at     TIMESTAMPTZ DEFAULT now()
);

After: normalized into three tables with foreign keys:

-- 1. Customers: each customer stored once
CREATE TABLE customers (
  customer_id SERIAL PRIMARY KEY,
  full_name   VARCHAR(120)  NOT NULL,
  email       VARCHAR(200)  NOT NULL UNIQUE,
  city        VARCHAR(100)
);

-- 2. Products: each product stored once
CREATE TABLE products (
  product_id   SERIAL PRIMARY KEY,
  product_name VARCHAR(200)   NOT NULL,
  base_price   NUMERIC(10,2)  NOT NULL CHECK (base_price >= 0)
);

-- 3. Orders: one row per order, referencing the customer
CREATE TABLE orders (
  order_id    SERIAL PRIMARY KEY,
  customer_id INT         NOT NULL REFERENCES customers(customer_id) ON DELETE RESTRICT,
  ordered_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- 4. Order lines: one row per product per order, capturing price at purchase time
CREATE TABLE order_lines (
  order_line_id SERIAL PRIMARY KEY,
  order_id      INT          NOT NULL REFERENCES orders(order_id) ON DELETE CASCADE,
  product_id    INT          NOT NULL REFERENCES products(product_id) ON DELETE RESTRICT,
  quantity      INT          NOT NULL CHECK (quantity > 0),
  unit_price    NUMERIC(10,2) NOT NULL CHECK (unit_price >= 0),
  UNIQUE (order_id, product_id)
);

Querying the normalized schema — it reads just as clearly:

-- All orders for Alice, with product names and totals
SELECT
  o.order_id,
  o.ordered_at,
  p.product_name,
  ol.quantity,
  ol.unit_price,
  (ol.quantity * ol.unit_price) AS line_total
FROM orders o
JOIN customers   c  ON c.customer_id = o.customer_id
JOIN order_lines ol ON ol.order_id   = o.order_id
JOIN products    p  ON p.product_id  = ol.product_id
WHERE c.email = 'alice@example.com'
ORDER BY o.ordered_at DESC;

🟣 When to denormalize on purpose

Normalization is the default. Denormalization is a deliberate trade: you accept controlled redundancy to gain read performance or simplicity — and you accept the cost of keeping the redundancy correct.

Legitimate reasons to denormalize:

  • A reporting table that is rebuilt nightly from normalized data (data warehouses do this constantly).
  • A total_amount column cached on orders so a dashboard query does not need to sum all line items on every page load.
  • A full_address text column for shipping labels where the address must be frozen at the time of purchase even if the customer moves.
-- Caching an order total: fast to read, must be kept in sync
ALTER TABLE orders ADD COLUMN cached_total NUMERIC(10,2);

-- Update it via a trigger or application logic after every order_lines change
UPDATE orders o
SET cached_total = (
  SELECT COALESCE(SUM(quantity * unit_price), 0)
  FROM order_lines
  WHERE order_id = o.order_id
)
WHERE o.order_id = 42;

Mental model: Denormalization is a documented exception, not a starting point. Write a comment in your migration explaining why the redundancy exists and what is responsible for keeping it consistent.

Why this matters: Every denormalized column is a promise you make to every developer who touches that table. If you cannot describe exactly how and when the redundant value gets updated, the column will eventually lie.

Common mistake: Premature denormalization. Engineers denormalize a column "for performance" before measuring whether the normalized query is actually slow. Profile first. A single index often provides the same speedup with no redundancy.


🛠️ Your mission

Open your own project's schema (or the orders_flat table above if you do not have one yet). Find a duplication problem — a column that repeats the same fact across multiple rows — and normalize it.

  1. Identify which columns have data that does not depend on the primary key alone.
  2. Extract those columns into a new table with its own primary key.
  3. Replace the original columns with a foreign key column pointing to the new table.
  4. Write the JOIN query that reconstructs the original view of the data.

If you want a challenge: add an ON DELETE clause to your foreign keys (RESTRICT, CASCADE, or SET NULL) and be ready to explain why you chose each one.


✅ You're done when…

  • You can name all three normal forms and describe each one in one plain sentence, without looking at the Data Modeling Cheat Sheet from Lesson 1.
  • Your schema has no column whose value can be determined from a non-key column (no transitive dependencies).
  • Every table-to-table relationship is enforced by a foreign key constraint, not just application code.
  • You have written at least one JOIN query that reads from your normalized tables and produces the same result as the original flat query.
  • If you denormalized anything, you have a comment in the migration explaining what keeps it in sync.

➡️ Next: Indexes & Query Performance. Build It Right, Or Don't Build It At All. 🏛️

Always-on rigor toolkit

🏛️ Build It Right, Or Don't Build It At All.