Skip to main content
Data & Databases
🗄️ Data & DatabasesLesson 1 of 13

Data Modeling: Think Before You Store

Think before you store: model entities and relationships so your data makes sense.

Data Modeling: Think Before You Store

Stage 3 · Data & Databases · B.U.I.L.D. letter: U

You built the front end by vibing — you asked AI to scaffold components and it worked. Databases are different. Once real data lives in a bad schema, every feature you add will fight it. This lesson teaches you to think first, so you never have to rewrite a live production database at 2 a.m.


⚠️ The vibe trap

When you say "set up a database for my blog," the AI will pick a schema without knowing your business rules. It might give you a single posts table and call it done — which works for a toy, but breaks the moment you add authors, tags, or drafts. The schema you accept in the first five minutes will shape every query you write for years. A bad model does not just cause bugs; it causes structural debt that compounds with every new feature.


🗺️ Section 1 — Entities, Attributes, and Relationships

Mental model: A database is a filing system for facts about the real world. An entity is a thing you need to remember (a user, a product, an order). An attribute is a fact about that thing (a user's email, a product's price). A relationship is a connection between things (a user places an order).

How to find your entities: Read your feature list and circle every noun. Those nouns are candidate entities. Verbs between nouns are candidate relationships.

Feature: "Users can write blog posts. Posts can have multiple tags.
          Readers can leave comments on posts."

Nouns  → User, Post, Tag, Comment
Verbs  → writes (User→Post), has (Post→Tag), leaves (Reader→Comment on Post)

Entities:
  User
  Post
  Tag
  Comment

Relationships:
  User  ──(writes)──►  Post        (one user, many posts)
  Post  ──(has)──►     Tag         (many posts, many tags — junction needed)
  Post  ──(receives)──► Comment    (one post, many comments)
  User  ──(writes)──►  Comment     (one user, many comments)

Why this matters: If you skip the noun exercise and just start typing CREATE TABLE, you will forget the Tag concept entirely, bolt it on later as a comma-separated string inside posts.tags, and spend three months trying to query it.

Common mistake: Treating verbs as attributes. "A post has an author name" sounds like an attribute — posts.author_name VARCHAR — but "author" is really an entity (User) you want to query, filter, and link to a profile. Store the relationship (a foreign key), not a copy of the string.


🔑 Section 2 — Primary Keys: Natural vs. Surrogate

Mental model: Every row in every table needs a unique, stable identifier — a primary key (PK). You have two flavors:

TypeWhat it isExample
Natural keyA real-world value that is inherently uniqueemail, isbn, username
Surrogate keyA made-up ID the database generates for youuuid, bigint auto-increment
-- Natural key approach (email as PK)
CREATE TABLE users (
  email       TEXT        PRIMARY KEY,
  display_name TEXT       NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Surrogate key approach (UUID as PK)
CREATE TABLE users (
  id          UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
  email       TEXT        NOT NULL UNIQUE,
  display_name TEXT       NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

Why surrogate keys almost always win:

  • Emails change. Usernames change. ISBNs get re-issued. A surrogate key never changes, so every foreign key that points at a user row stays valid forever.
  • Exposing a sequential integer PK in a URL (/users/42) leaks your user count to competitors. UUIDs are opaque.
  • Natural keys that seem unique often are not (two people can share a phone number on a family plan; two products can share an EAN if a supplier reuses codes).

When natural keys are fine: Look-up / reference tables where the code is the identity. A countries table keyed on iso_code CHAR(2) is perfectly stable. A currencies table keyed on code CHAR(3) will never change.

Common mistake: Using email as a PK in users and then linking it as a foreign key in five other tables. When a user changes their email, you need a CASCADE UPDATE across every one of those tables — and most ORMs do not handle that gracefully.


📐 Section 3 — ER Thinking Without a Fancy Tool

You do not need Lucidchart or draw.io. A simple list captures the same information:

ENTITY: User
  id          UUID   PK
  email       TEXT   UNIQUE NOT NULL
  display_name TEXT  NOT NULL
  created_at  TIMESTAMPTZ

ENTITY: Post
  id          UUID   PK
  author_id   UUID   FK → User.id
  title       TEXT   NOT NULL
  body        TEXT   NOT NULL
  status      TEXT   CHECK IN ('draft','published','archived')
  published_at TIMESTAMPTZ
  created_at  TIMESTAMPTZ

ENTITY: Tag
  id          UUID   PK
  name        TEXT   UNIQUE NOT NULL

JUNCTION: post_tags   (Post ↔ Tag, many-to-many)
  post_id     UUID   FK → Post.id
  tag_id      UUID   FK → Tag.id
  PRIMARY KEY (post_id, tag_id)

ENTITY: Comment
  id          UUID   PK
  post_id     UUID   FK → Post.id
  author_id   UUID   FK → User.id
  body        TEXT   NOT NULL
  created_at  TIMESTAMPTZ

This notation — entity blocks with PK/FK/UNIQUE annotations and a short description of the cardinality — is everything you need before writing a single line of SQL. You can paste it into a PR description, a Notion doc, or a comment at the top of your migration file.

Why this matters: Writing the ER sketch first forces you to notice the many-to-many between Post and Tag. If you had jumped straight to SQL, you would have written posts.tags TEXT[] and regretted it the moment a stakeholder asked "show me all posts with the tag performance."

Common mistake: Skipping the junction table. A TEXT[] column for tags looks clever until you need to rename a tag across 10,000 posts, or count how many posts use each tag, or let users follow a tag. A proper junction table makes all of those queries trivial joins.


🏗️ Section 4 — The Worked Example: From ER Sketch to CREATE TABLE

Here is the full blog domain translated into Postgres DDL. Read it as a story, not just syntax.

-- Users come first because everything references them.
CREATE TABLE users (
  id           UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
  email        TEXT        NOT NULL UNIQUE,
  display_name TEXT        NOT NULL,
  created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Posts belong to a user (author_id).
-- status is constrained to three known values — the DB enforces your business rule.
CREATE TABLE posts (
  id           UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
  author_id    UUID        NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  title        TEXT        NOT NULL,
  body         TEXT        NOT NULL DEFAULT '',
  status       TEXT        NOT NULL DEFAULT 'draft'
                           CHECK (status IN ('draft', 'published', 'archived')),
  published_at TIMESTAMPTZ,
  created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Tags are independent — they exist whether or not any post uses them.
CREATE TABLE tags (
  id   UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name TEXT NOT NULL UNIQUE
);

-- Junction table resolves the many-to-many.
-- The composite PK prevents duplicate (post, tag) pairs at the DB level.
CREATE TABLE post_tags (
  post_id UUID NOT NULL REFERENCES posts(id) ON DELETE CASCADE,
  tag_id  UUID NOT NULL REFERENCES tags(id)  ON DELETE CASCADE,
  PRIMARY KEY (post_id, tag_id)
);

-- Comments belong to both a post and a user.
CREATE TABLE comments (
  id         UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
  post_id    UUID        NOT NULL REFERENCES posts(id) ON DELETE CASCADE,
  author_id  UUID        NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  body       TEXT        NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Notice what the schema does for you automatically:

  • ON DELETE CASCADE on both posts.author_id and comments.author_id means if you delete a user, their posts and comments disappear too. That is a business rule baked into the DB — you do not need application code to enforce it.
  • The CHECK constraint on status means no bug in your app can ever insert status = 'unpublished' by accident.
  • The composite PK on post_tags means you cannot tag the same post with the same tag twice — no DISTINCT queries needed.

Common mistake: Forgetting ON DELETE behavior. If you leave it as the default (RESTRICT), deleting a user will throw a foreign-key violation error because their posts still exist. Decide the cascade rule when you design the table, not when the first delete fails in production.


💥 Section 5 — How a Bad Model Causes Pain Forever

Imagine you shipped this instead:

-- The "quick" schema — six months of pain compressed into eight lines.
CREATE TABLE posts (
  id         SERIAL PRIMARY KEY,
  author_email TEXT,       -- no FK, just a string copy
  title      TEXT,
  body       TEXT,
  tags       TEXT,         -- "javascript,react,performance" comma-separated
  status     TEXT,         -- no CHECK constraint; typos welcome
  created_at TIMESTAMP
);

Pain #1 — Author email changes. You stored author_email as a plain string. When the user changes their email in users, none of the posts update. Now posts show a dead email address. Fix: a full UPDATE posts SET author_email = $new WHERE author_email = $old migration that has to run while users are online.

Pain #2 — Tag queries are impossible without parsing. "Give me all posts tagged performance" becomes:

SELECT * FROM posts WHERE tags LIKE '%performance%';

This is a full table scan every time, it matches "high-performance" accidentally, and it cannot use an index. With a proper junction table it is a two-table join that can be indexed perfectly.

Pain #3 — Status typos reach production.

INSERT INTO posts (status) VALUES ('pubished'); -- silent success

No constraint caught it. Every dashboard that counts published posts is now wrong.

Pain #4 — SERIAL leaks row counts. /posts/1, /posts/2, /posts/3 — your competitors know you have three posts.

These are not hypothetical. They are the exact bugs that appear in every real codebase that skipped the modeling step. The time cost of fixing a live schema with millions of rows is measured in days, not hours.


🛠️ Your Mission

Pick your own app — the one you have been building, or one you want to build next.

  1. Write a plain-English feature list. Three to five features is enough.
  2. Circle every noun. List them as candidate entities.
  3. Write an ER sketch in the notation from Section 3 — entity blocks with PK/FK/UNIQUE, and cardinality labels on relationships.
  4. Translate it to CREATE TABLE statements — use surrogate UUID PKs, NOT NULL where appropriate, CHECK constraints for status-style columns, and explicit ON DELETE rules on every FK.
  5. Sanity-check your many-to-many relationships. Every one needs a junction table. If you have none, look harder — you probably have at least one hiding.

✅ You're done when…

  • Your ER sketch lists every entity, its attributes, and the cardinality of each relationship (matching the format in the Data Modeling Cheat Sheet from Section 3)
  • Every table has a surrogate UUID primary key and a created_at timestamp
  • Every foreign key has an explicit ON DELETE rule and you can explain why you chose CASCADE vs. SET NULL vs. RESTRICT for each one
  • Any many-to-many relationship has its own junction table with a composite primary key — no TEXT[] or comma-separated strings
  • You have at least one CHECK constraint enforcing a business rule that would otherwise rely on application code alone

➡️ Next: Relational Databases & SQL. Build It Right, Or Don't Build It At All. 🏛️

Always-on rigor toolkit

🏛️ Build It Right, Or Don't Build It At All.