SQL, Deeper
Stage 3 · Data & Databases · B.U.I.L.D. letter: U
You can vibe-code a frontend that talks to a database. Now learn to ask the database questions so precise that the answer comes back ready to use — no JavaScript loops, no extra fetches, no glue code holding it together with hope.
⚠️ The vibe trap
When you don't know SQL well, you reach for the comfortable move: fetch users in one query, fetch orders in another, then smash them together in JavaScript with a .find() or a for loop. It works — until your users table has 50,000 rows, your orders table has 400,000, and your server is now hauling 450,000 rows across the network just to show a 20-row report. The database was built to do this work in milliseconds, at the storage layer, before a single byte hits the wire. Let it.
🗂️ The Sample Schema
Every example in this lesson uses these two tables. Keep them in mind — they're your playground.
-- users: one row per registered account
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
email TEXT UNIQUE NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- orders: one row per purchase; every order belongs to a user
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INT NOT NULL REFERENCES users(id),
amount NUMERIC(10,2) NOT NULL,
status TEXT NOT NULL, -- 'pending' | 'paid' | 'refunded'
created_at TIMESTAMPTZ DEFAULT NOW()
);
orders.user_id is the foreign key — the thread that sews the two tables together.
🔗 Section 1 — INNER JOIN: Only the Matching Rows
Mental model
Picture two circles of a Venn diagram. An INNER JOIN gives you only the overlap — rows that exist in both tables and share the same key value. If a user has no orders, they don't appear. If an order somehow has no matching user, it doesn't appear either.
Worked example
"Show me every order alongside the buyer's name and email."
SELECT
u.id AS user_id,
u.name,
u.email,
o.id AS order_id,
o.amount,
o.status,
o.created_at
FROM orders o
INNER JOIN users u ON u.id = o.user_id
ORDER BY o.created_at DESC;
What's happening line by line:
| Clause | Purpose |
|---|---|
FROM orders o | Start from the orders table, alias it o |
INNER JOIN users u | Pull in users, alias u |
ON u.id = o.user_id | The join condition — match where keys are equal |
ORDER BY o.created_at DESC | Newest orders first |
Why this beats two queries
One round-trip to the DB, one result set. The DB uses indexes on both users.id and orders.user_id to do the match at storage speed.
Common mistake
Writing SELECT * across a join gives you every column from every table — including two columns both named id, which silently overwrites one in some drivers. Always alias ambiguous columns: u.id AS user_id, o.id AS order_id.
↔️ Section 2 — LEFT JOIN: Keep Everyone, Even the Lonely Ones
Mental model
A LEFT JOIN says: "Give me every row from the left table (the one in FROM), and for each one, attach the matching right-table row if it exists. If it doesn't exist, fill those columns with NULL."
This is how you answer questions like "which users have never placed an order?" or "show all users with their order count, including zero-order users."
Worked example
"List every user and how many orders they've placed — including users with zero orders."
SELECT
u.id,
u.name,
COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id, u.name
ORDER BY order_count DESC;
Key insight on COUNT(o.id) vs COUNT(*):
COUNT(*)counts every row, including NULL-padded ones.COUNT(o.id)counts only rows whereo.idis NOT NULL — i.e., users who actually have orders.- For users with no orders,
o.idis NULL after the LEFT JOIN, soCOUNT(o.id)correctly returns0.
Why you need LEFT (not INNER) here
If you used INNER JOIN, users with no orders would vanish from the results. That's technically correct for "show me buyers" but wrong for "show me all users." Pick your JOIN based on what the question is, not habit.
Common mistake
Forgetting to GROUP BY all non-aggregated SELECT columns. If you select u.id, u.name and use COUNT(), both u.id and u.name must appear in the GROUP BY. Modern Postgres will tell you — other databases will silently return garbage.
📊 Section 3 — Aggregations: COUNT, SUM, AVG + GROUP BY + HAVING
Mental model
Aggregations collapse multiple rows into one number. GROUP BY decides what the "buckets" are — each unique combination of GROUP BY columns becomes one output row, with the aggregate computed over all raw rows in that bucket.
HAVING is like WHERE, but it filters after aggregation. Use WHERE to filter raw rows before grouping; use HAVING to filter the aggregated results.
Worked example
"Revenue report: total paid revenue and average order value per user, only for users whose total paid revenue exceeds $100."
SELECT
u.id,
u.name,
COUNT(o.id) AS paid_order_count,
SUM(o.amount) AS total_revenue,
ROUND(AVG(o.amount), 2) AS avg_order_value
FROM users u
INNER JOIN orders o ON o.user_id = u.id
WHERE o.status = 'paid' -- filter rows BEFORE grouping
GROUP BY u.id, u.name
HAVING SUM(o.amount) > 100 -- filter groups AFTER aggregation
ORDER BY total_revenue DESC;
The execution order the DB actually uses (not the order you write it):
FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT
This is why you cannot use a SELECT alias (total_revenue) inside a HAVING clause in strict SQL — HAVING runs before SELECT aliases are resolved. Repeat the expression (SUM(o.amount) > 100) in HAVING, or use a subquery/CTE (next section).
Common mistake
Using HAVING when you mean WHERE. HAVING status = 'paid' forces the DB to aggregate every status first, then throw most rows away. WHERE status = 'paid' filters before aggregation — orders of magnitude faster on large tables.
🧵 Section 4 — Subqueries & CTEs: Breaking Complex Queries Apart
Mental model
A subquery is a SELECT nested inside another SELECT. A CTE (Common Table Expression, written with WITH) is a named subquery you can reference multiple times. CTEs make queries readable — they let you build up a complex result step by step, like naming intermediate variables in code.
Worked example: find users who placed an order in the last 30 days
Subquery style (harder to read at scale):
SELECT id, name, email
FROM users
WHERE id IN (
SELECT DISTINCT user_id
FROM orders
WHERE created_at >= NOW() - INTERVAL '30 days'
);
CTE style (same result, much more readable):
WITH recent_buyers AS (
SELECT DISTINCT user_id
FROM orders
WHERE created_at >= NOW() - INTERVAL '30 days'
)
SELECT u.id, u.name, u.email
FROM users u
INNER JOIN recent_buyers rb ON rb.user_id = u.id
ORDER BY u.name;
Stacking CTEs for a full pipeline
WITH paid_orders AS (
-- Step 1: isolate only paid orders
SELECT user_id, amount
FROM orders
WHERE status = 'paid'
),
user_revenue AS (
-- Step 2: sum revenue per user
SELECT user_id, SUM(amount) AS total
FROM paid_orders
GROUP BY user_id
),
top_spenders AS (
-- Step 3: keep only the top 10
SELECT user_id, total
FROM user_revenue
ORDER BY total DESC
LIMIT 10
)
-- Final: join names back in
SELECT u.name, u.email, ts.total AS lifetime_value
FROM top_spenders ts
INNER JOIN users u ON u.id = ts.user_id
ORDER BY ts.total DESC;
Each CTE is like a named variable. You write, read, and debug one step at a time.
Common mistake
Reaching for CTEs to improve performance — they don't automatically do that in most databases (Postgres inlines them unless you add MATERIALIZED). CTEs are a readability tool. For performance, you still need proper indexes (that's Lesson 5).
📖 Section 5 — Reading a Multi-Table Query Cold
When you encounter a query someone else wrote, or one your AI assistant generated, here is a reliable reading order:
- FROM + JOIN — figure out what tables are in play and how they connect.
- WHERE — understand what rows are being excluded before any math happens.
- GROUP BY — identify the granularity of the output (one row per user? per day? per status?).
- SELECT — now read what columns and aggregates are actually being returned.
- HAVING — see if any groups are filtered out after aggregation.
- ORDER BY / LIMIT — see how results are sorted and capped.
Worked example: decode this query
SELECT
DATE_TRUNC('month', o.created_at) AS month,
COUNT(DISTINCT o.user_id) AS unique_buyers,
SUM(o.amount) AS gmv,
COUNT(o.id) AS order_count
FROM orders o
LEFT JOIN users u ON u.id = o.user_id
WHERE o.status = 'paid'
AND u.created_at >= '2025-01-01'
GROUP BY DATE_TRUNC('month', o.created_at)
HAVING COUNT(o.id) >= 5
ORDER BY month ASC;
Walk-through:
- FROM orders + LEFT JOIN users — we're centering on orders; users are joined to let us filter on
u.created_at. - WHERE — only paid orders from accounts created in 2025 or later.
- GROUP BY — one row per calendar month (truncated to the first of the month).
- SELECT — each row tells us: the month, how many distinct buyers ordered, gross merchandise value (GMV), and total order count.
- HAVING — only show months with at least 5 paid orders.
- ORDER BY — oldest month first, so it reads like a timeline.
Result: a month-by-month revenue dashboard, filtered to 2025-cohort users, for months with meaningful volume.
🛠️ Your Mission
You've got a real project. It has users and some kind of content, transaction, or event table connected by a foreign key. Your job is to write two queries against it.
Query 1 — The JOIN Report
Write an INNER JOIN (or LEFT JOIN if you want "all users including inactive ones") that pulls together at least two tables and returns a result you'd actually show in a UI — an order history list, a user-activity feed, a content dashboard. Include at least four columns, with at least one from each table. Add ORDER BY and LIMIT 20.
Query 2 — The GROUP BY Report
Write an aggregation query that answers a business question: "which users are most active?", "what's the daily revenue for the last 30 days?", "how many items per category?". Use at least one aggregate function (COUNT, SUM, or AVG), GROUP BY, and HAVING to filter out noise (e.g., HAVING COUNT(...) > 0).
If you don't have a real project handy, write both queries against the users / orders schema at the top of this lesson. Change the numbers, add a new status, make it yours.
✅ You're done when…
- You can explain the difference between INNER JOIN and LEFT JOIN out loud without reading notes — including what happens to a user with no orders in each case.
- You've written a GROUP BY query that uses both WHERE (to pre-filter rows) and HAVING (to post-filter groups), and you understand why the order matters.
- You've refactored at least one multi-step subquery into a CTE and can confirm the result is identical.
- You can read a 15-line multi-table query cold and describe what it returns before running it.
- Your Mission queries run without errors and return data that makes sense (verify against the Pre-Ship Checklist before marking this done).
➡️ Next: Normalization & Schema Design. Build It Right, Or Don't Build It At All. 🏛️