Data Modeling: Think Before You Store
Stage 3 · Data & Databases · B.U.I.L.D. letter: U
You built the front end by vibing — you asked AI to scaffold components and it worked. Databases are different. Once real data lives in a bad schema, every feature you add will fight it. This lesson teaches you to think first, so you never have to rewrite a live production database at 2 a.m.
⚠️ The vibe trap
When you say "set up a database for my blog," the AI will pick a schema without knowing your business rules. It might give you a single posts table and call it done — which works for a toy, but breaks the moment you add authors, tags, or drafts. The schema you accept in the first five minutes will shape every query you write for years. A bad model does not just cause bugs; it causes structural debt that compounds with every new feature.
🗺️ Section 1 — Entities, Attributes, and Relationships
Mental model: A database is a filing system for facts about the real world. An entity is a thing you need to remember (a user, a product, an order). An attribute is a fact about that thing (a user's email, a product's price). A relationship is a connection between things (a user places an order).
How to find your entities: Read your feature list and circle every noun. Those nouns are candidate entities. Verbs between nouns are candidate relationships.
Feature: "Users can write blog posts. Posts can have multiple tags.
Readers can leave comments on posts."
Nouns → User, Post, Tag, Comment
Verbs → writes (User→Post), has (Post→Tag), leaves (Reader→Comment on Post)
Entities:
User
Post
Tag
Comment
Relationships:
User ──(writes)──► Post (one user, many posts)
Post ──(has)──► Tag (many posts, many tags — junction needed)
Post ──(receives)──► Comment (one post, many comments)
User ──(writes)──► Comment (one user, many comments)
Why this matters: If you skip the noun exercise and just start typing CREATE TABLE, you will forget the Tag concept entirely, bolt it on later as a comma-separated string inside posts.tags, and spend three months trying to query it.
Common mistake: Treating verbs as attributes. "A post has an author name" sounds like an attribute — posts.author_name VARCHAR — but "author" is really an entity (User) you want to query, filter, and link to a profile. Store the relationship (a foreign key), not a copy of the string.
🔑 Section 2 — Primary Keys: Natural vs. Surrogate
Mental model: Every row in every table needs a unique, stable identifier — a primary key (PK). You have two flavors:
| Type | What it is | Example |
|---|---|---|
| Natural key | A real-world value that is inherently unique | email, isbn, username |
| Surrogate key | A made-up ID the database generates for you | uuid, bigint auto-increment |
-- Natural key approach (email as PK)
CREATE TABLE users (
email TEXT PRIMARY KEY,
display_name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Surrogate key approach (UUID as PK)
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT NOT NULL UNIQUE,
display_name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Why surrogate keys almost always win:
- Emails change. Usernames change. ISBNs get re-issued. A surrogate key never changes, so every foreign key that points at a user row stays valid forever.
- Exposing a sequential integer PK in a URL (
/users/42) leaks your user count to competitors. UUIDs are opaque. - Natural keys that seem unique often are not (two people can share a phone number on a family plan; two products can share an EAN if a supplier reuses codes).
When natural keys are fine: Look-up / reference tables where the code is the identity. A countries table keyed on iso_code CHAR(2) is perfectly stable. A currencies table keyed on code CHAR(3) will never change.
Common mistake: Using email as a PK in users and then linking it as a foreign key in five other tables. When a user changes their email, you need a CASCADE UPDATE across every one of those tables — and most ORMs do not handle that gracefully.
📐 Section 3 — ER Thinking Without a Fancy Tool
You do not need Lucidchart or draw.io. A simple list captures the same information:
ENTITY: User
id UUID PK
email TEXT UNIQUE NOT NULL
display_name TEXT NOT NULL
created_at TIMESTAMPTZ
ENTITY: Post
id UUID PK
author_id UUID FK → User.id
title TEXT NOT NULL
body TEXT NOT NULL
status TEXT CHECK IN ('draft','published','archived')
published_at TIMESTAMPTZ
created_at TIMESTAMPTZ
ENTITY: Tag
id UUID PK
name TEXT UNIQUE NOT NULL
JUNCTION: post_tags (Post ↔ Tag, many-to-many)
post_id UUID FK → Post.id
tag_id UUID FK → Tag.id
PRIMARY KEY (post_id, tag_id)
ENTITY: Comment
id UUID PK
post_id UUID FK → Post.id
author_id UUID FK → User.id
body TEXT NOT NULL
created_at TIMESTAMPTZ
This notation — entity blocks with PK/FK/UNIQUE annotations and a short description of the cardinality — is everything you need before writing a single line of SQL. You can paste it into a PR description, a Notion doc, or a comment at the top of your migration file.
Why this matters: Writing the ER sketch first forces you to notice the many-to-many between Post and Tag. If you had jumped straight to SQL, you would have written posts.tags TEXT[] and regretted it the moment a stakeholder asked "show me all posts with the tag performance."
Common mistake: Skipping the junction table. A TEXT[] column for tags looks clever until you need to rename a tag across 10,000 posts, or count how many posts use each tag, or let users follow a tag. A proper junction table makes all of those queries trivial joins.
🏗️ Section 4 — The Worked Example: From ER Sketch to CREATE TABLE
Here is the full blog domain translated into Postgres DDL. Read it as a story, not just syntax.
-- Users come first because everything references them.
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT NOT NULL UNIQUE,
display_name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Posts belong to a user (author_id).
-- status is constrained to three known values — the DB enforces your business rule.
CREATE TABLE posts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
author_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
title TEXT NOT NULL,
body TEXT NOT NULL DEFAULT '',
status TEXT NOT NULL DEFAULT 'draft'
CHECK (status IN ('draft', 'published', 'archived')),
published_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Tags are independent — they exist whether or not any post uses them.
CREATE TABLE tags (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL UNIQUE
);
-- Junction table resolves the many-to-many.
-- The composite PK prevents duplicate (post, tag) pairs at the DB level.
CREATE TABLE post_tags (
post_id UUID NOT NULL REFERENCES posts(id) ON DELETE CASCADE,
tag_id UUID NOT NULL REFERENCES tags(id) ON DELETE CASCADE,
PRIMARY KEY (post_id, tag_id)
);
-- Comments belong to both a post and a user.
CREATE TABLE comments (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
post_id UUID NOT NULL REFERENCES posts(id) ON DELETE CASCADE,
author_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
body TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Notice what the schema does for you automatically:
ON DELETE CASCADEon bothposts.author_idandcomments.author_idmeans if you delete a user, their posts and comments disappear too. That is a business rule baked into the DB — you do not need application code to enforce it.- The
CHECKconstraint onstatusmeans no bug in your app can ever insertstatus = 'unpublished'by accident. - The composite PK on
post_tagsmeans you cannot tag the same post with the same tag twice — noDISTINCTqueries needed.
Common mistake: Forgetting ON DELETE behavior. If you leave it as the default (RESTRICT), deleting a user will throw a foreign-key violation error because their posts still exist. Decide the cascade rule when you design the table, not when the first delete fails in production.
💥 Section 5 — How a Bad Model Causes Pain Forever
Imagine you shipped this instead:
-- The "quick" schema — six months of pain compressed into eight lines.
CREATE TABLE posts (
id SERIAL PRIMARY KEY,
author_email TEXT, -- no FK, just a string copy
title TEXT,
body TEXT,
tags TEXT, -- "javascript,react,performance" comma-separated
status TEXT, -- no CHECK constraint; typos welcome
created_at TIMESTAMP
);
Pain #1 — Author email changes. You stored author_email as a plain string. When the user changes their email in users, none of the posts update. Now posts show a dead email address. Fix: a full UPDATE posts SET author_email = $new WHERE author_email = $old migration that has to run while users are online.
Pain #2 — Tag queries are impossible without parsing. "Give me all posts tagged performance" becomes:
SELECT * FROM posts WHERE tags LIKE '%performance%';
This is a full table scan every time, it matches "high-performance" accidentally, and it cannot use an index. With a proper junction table it is a two-table join that can be indexed perfectly.
Pain #3 — Status typos reach production.
INSERT INTO posts (status) VALUES ('pubished'); -- silent success
No constraint caught it. Every dashboard that counts published posts is now wrong.
Pain #4 — SERIAL leaks row counts. /posts/1, /posts/2, /posts/3 — your competitors know you have three posts.
These are not hypothetical. They are the exact bugs that appear in every real codebase that skipped the modeling step. The time cost of fixing a live schema with millions of rows is measured in days, not hours.
🛠️ Your Mission
Pick your own app — the one you have been building, or one you want to build next.
- Write a plain-English feature list. Three to five features is enough.
- Circle every noun. List them as candidate entities.
- Write an ER sketch in the notation from Section 3 — entity blocks with PK/FK/UNIQUE, and cardinality labels on relationships.
- Translate it to CREATE TABLE statements — use surrogate UUID PKs, NOT NULL where appropriate, CHECK constraints for status-style columns, and explicit ON DELETE rules on every FK.
- Sanity-check your many-to-many relationships. Every one needs a junction table. If you have none, look harder — you probably have at least one hiding.
✅ You're done when…
- Your ER sketch lists every entity, its attributes, and the cardinality of each relationship (matching the format in the Data Modeling Cheat Sheet from Section 3)
- Every table has a surrogate UUID primary key and a
created_attimestamp - Every foreign key has an explicit
ON DELETErule and you can explain why you chose CASCADE vs. SET NULL vs. RESTRICT for each one - Any many-to-many relationship has its own junction table with a composite primary key — no
TEXT[]or comma-separated strings - You have at least one
CHECKconstraint enforcing a business rule that would otherwise rely on application code alone
➡️ Next: Relational Databases & SQL. Build It Right, Or Don't Build It At All. 🏛️