Skip to main content
Data & Databases
🗄️ Data & DatabasesLesson 11 of 13

Search: When LIKE Isn't Enough

When LIKE isn't enough: real full-text search that finds what users actually mean.

Search: When LIKE Isn't Enough

Stage 3 · Data & Databases · B.U.I.L.D. letter: I

You shipped a search box in an afternoon. Users type "recipe," and the app finds it. You feel good. Then someone types "recipes" and gets nothing. Someone else types "choclate cake" and gets nothing. Your product manager types something into your search box on demo day and the spinner just… spins. Every vibe coder hits this wall. Here is how to smash through it.


⚠️ The vibe trap

You wired up WHERE title LIKE '%search term%' and it worked in development with 40 rows. In production with 400,000 rows, Postgres reads every single row on every single search — no index can help it because the wildcard is on the left side. Worse, LIKE has no concept of relevance (a post titled "chocolate" and one that mentions chocolate in the 500th word look identical to it), no stemming (so "run" and "running" are different strings), and zero tolerance for typos. Your users will assume your app is broken, because functionally, for search, it is.


🔬 Why LIKE Fails at Scale

Understanding the failure mode makes the fix obvious.

-- The classic vibe-coder search query
SELECT id, title, body
FROM posts
WHERE title LIKE '%chocolate%'
   OR body  LIKE '%chocolate%'
ORDER BY created_at DESC
LIMIT 20;

What Postgres actually does: It opens the posts table and reads every row, checking whether the string chocolate appears anywhere inside title or body. There is no index shortcut. The leading % wildcard tells Postgres "the match could start anywhere," so even a B-tree index on title is useless.

Mental model: Imagine a library where books are shelved in arrival order and have no catalog. Finding every book that mentions chocolate means opening every book and scanning every page. That is exactly what LIKE '%term%' does to your database.

Why it matters beyond speed:

  • No ranking — a post where "chocolate" appears 40 times looks the same as one where it appears once.
  • No stemming — LIKE '%run%' matches "runs" and "running" accidentally, but LIKE '%running%' misses "run" entirely.
  • Case sensitivity depends on collation and is inconsistent across databases.
  • No synonym support, no language awareness, no typo tolerance.

Common mistake: Adding a B-tree index on title and expecting it to speed up LIKE '%term%'. It will not. A B-tree index can only accelerate LIKE 'term%' (starts-with), never contains-style searches with a leading wildcard.


🏗️ Postgres Full-Text Search: The Built-In Upgrade

Postgres ships with a complete full-text search engine. You do not need to install anything.

The two core types

TypeWhat it is
tsvectorA preprocessed, normalized list of lexemes (roots of words) stored for a document
tsqueryA search query in the same normalized form, with operators like & (AND), | (OR), ! (NOT)
-- See what tsvector does to a sentence
SELECT to_tsvector('english', 'The quick brown foxes were running fast');
-- Result: 'brown':3 'fast':7 'fox':4 'quick':2 'run':5,6
-- Notice: "foxes" → "fox", "running" → "run", stop words removed

Mental model: tsvector is a pre-built index card for a document. It lists every significant word in its root form and records where that word appears. tsquery is your search query turned into the same root form so they can be compared fairly.

Why this matters: When a user searches "running," Postgres normalizes it to the lexeme "run" and finds documents containing "run," "runs," "runner," and "running" — because they all reduce to the same root. LIKE can never do this.

Common mistake: Forgetting to specify a language. to_tsvector('The fox') without a language argument uses your database default, which may not be what you expect. Always pass 'english' (or the correct locale) explicitly.


⚡ The GIN Index: Making Full-Text Search Fast

The tsvector type alone does not help with speed unless you pair it with a GIN (Generalized Inverted Index). A GIN index is the correct index type for full-text search — it maps every lexeme to the rows that contain it, so Postgres can jump directly to relevant rows instead of scanning everything.

Option A — Index a computed column (simplest)

-- Add a generated tsvector column and index it
ALTER TABLE posts
  ADD COLUMN search_vector tsvector
  GENERATED ALWAYS AS (
    to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))
  ) STORED;

CREATE INDEX idx_posts_search ON posts USING GIN (search_vector);

Now search:

SELECT id, title
FROM posts
WHERE search_vector @@ to_tsquery('english', 'chocolate')
LIMIT 20;

@@ is the match operator — it returns true when a tsvector contains the lexemes in a tsquery.

Mental model: The GIN index is the library catalog you never had. Postgres builds and maintains it automatically. When you search "chocolate," Postgres looks up "chocol" in the catalog and gets back exactly the row IDs that contain it — no full table scan.

Common mistake: Building the index before populating the table with real data is fine, but if you add the column to an existing large table without CONCURRENTLY, it will lock the table. In production, use CREATE INDEX CONCURRENTLY to build the index without downtime.

-- Safe way on a live table
CREATE INDEX CONCURRENTLY idx_posts_search ON posts USING GIN (search_vector);

Multi-word queries

-- AND: both words must appear
SELECT id, title FROM posts
WHERE search_vector @@ to_tsquery('english', 'chocolate & cake');

-- OR: either word
SELECT id, title FROM posts
WHERE search_vector @@ to_tsquery('english', 'chocolate | cocoa');

-- Phrase search with websearch_to_tsquery (friendlier syntax)
SELECT id, title FROM posts
WHERE search_vector @@ websearch_to_tsquery('english', 'chocolate cake recipe');

websearch_to_tsquery accepts natural user input (like a Google search box) and converts it intelligently. Use it when you are passing raw user input.


🎯 Ranking Results

Finding matching rows is only half the job. You also want the most relevant rows first.

SELECT
  id,
  title,
  ts_rank(search_vector, query)       AS rank
FROM posts,
  to_tsquery('english', 'chocolate & cake') AS query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 10;

ts_rank scores each matching document based on how often and how prominently the search terms appear. A higher score means a more relevant result.

Boost the title over the body — because a match in the title is more relevant than one buried in a long article:

SELECT
  id,
  title,
  ts_rank(
    setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
    setweight(to_tsvector('english', coalesce(body,  '')), 'C'),
    to_tsquery('english', 'chocolate')
  ) AS rank
FROM posts
WHERE search_vector @@ to_tsquery('english', 'chocolate')
ORDER BY rank DESC
LIMIT 10;

setweight assigns importance labels (A through D). ts_rank then factors those weights into the score. Title matches (A) count more than body matches (C).

Mental model: Think of it as a weighted vote. Every occurrence of the search term in the document casts a vote, but title votes count four times as much as body votes.

Common mistake: Ordering by rank DESC but forgetting LIMIT. Full-text search with ranking can still be slow if you ask Postgres to score and sort millions of matching rows without a row limit.


🚀 When Postgres Is Not Enough: Dedicated Search Engines

Postgres full-text search handles most apps well — up to millions of rows, with solid relevance. But at some point you may need more. Here is an honest map of that decision.

NeedPostgres FTSDedicated engine
Basic keyword search with rankingYesYes
Typo tolerance / fuzzy matchingPartial (via pg_trgm)First-class feature
Faceted search (filter by category + search)WorkableExcellent
Autocomplete / prefix searchAwkwardNative
Synonyms, custom dictionariesPossibleEasier
Billions of documentsStarts to struggleBuilt for this
Real-time indexing at write scaleFineCan be faster
Separate infrastructure to manageNo (it's your DB)Yes — another system

The three main contenders:

  • Elasticsearch — Battle-hardened, runs everywhere, huge feature set. Complex to operate. Good when you need the full power and have ops capacity.
  • Meilisearch — Beautifully simple, typo-tolerant by default, excellent developer experience. Ideal for apps where search is a core UX feature and you want fast setup.
  • Typesense — Open-source, fast, low operational overhead. Strong contender alongside Meilisearch for new projects.

The real trade-off is sync, not features. The moment you add a dedicated search engine, you now have two systems that must agree on your data. Every write to your database must also update the search index — either synchronously (slows writes), asynchronously (eventual consistency lag), or via a background job (complexity). Postgres FTS avoids this entirely because the index lives in the same transaction as your data.

Rule of thumb: Start with Postgres FTS. Add a dedicated engine only when you have a concrete, measured reason — not because you think you might need it someday.

Common mistake: Reaching for Elasticsearch on day one because you've heard it's "the real" search. Running Elasticsearch in production requires memory tuning, cluster management, and index lifecycle policies. For most apps, Postgres FTS with a GIN index is already the real search.


🔤 Fuzzy Matching: Handling Typos Without a Dedicated Engine

Postgres ships with the pg_trgm extension, which enables fuzzy (similarity-based) matching. It works by breaking strings into trigrams (three-character substrings) and measuring overlap.

-- Enable once per database
CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- Find posts where title is similar to a misspelled word
SELECT id, title, similarity(title, 'choclate') AS sim
FROM posts
WHERE similarity(title, 'choclate') > 0.3
ORDER BY sim DESC
LIMIT 10;

You can also create a GIN index on trigrams to make this fast:

CREATE INDEX idx_posts_title_trgm ON posts USING GIN (title gin_trgm_ops);

Mental model: Trigrams break "chocolate" into cho, hoc, oco, col, ola, lat, ate. "Choclate" shares most of those substrings. A similarity score above 0.3 is a reasonable fuzzy match threshold — adjust to taste.

When to use trigrams vs. full-text search: Use full-text search (tsvector/tsquery) when users search for concepts and words. Use trigrams when users might misspell proper nouns, names, or specific terms. Many production apps combine both — FTS for relevance, trigrams for typo tolerance.

Common mistake: Using LIKE '%term%' and pg_trgm together thinking they complement each other. They don't — they are two separate approaches. Pick one per query, or combine them intentionally with a union or weighted score.


🛠️ Your Mission

You have a search box in your app backed by LIKE '%query%'. Upgrade it to real full-text search.

  1. Open your database and identify the table powering your search box.
  2. Add a tsvector generated column that combines the fields users would expect to search across (title + description, name + bio, subject + body — whatever fits your data).
  3. Create a GIN index on that column.
  4. Rewrite your query to use @@ and websearch_to_tsquery.
  5. Add ts_rank and sort by relevance.
  6. Search for a word, then the plural, then a related word. Confirm all three return sensible results.
  7. Compare EXPLAIN ANALYZE output on the old LIKE query and the new FTS query. See the difference in execution time and rows scanned.

✅ You're done when…

  • The EXPLAIN ANALYZE output shows an Index Scan on your GIN index instead of a Seq Scan (Production-Readiness Checklist: every query touching more than ~1,000 rows should hit an index)
  • Searching the singular form of a word (e.g., "recipe") returns results containing the plural form ("recipes") — confirming stemming is working
  • Results are ordered by ts_rank so the most relevant row appears first, not just the most recent
  • Your query uses websearch_to_tsquery (not to_tsquery) so that raw user input with spaces and punctuation does not throw a syntax error

➡️ Next: Backups, Restores & Data Safety. Build It Right, Or Don't Build It At All. 🏛️

Always-on rigor toolkit

🏛️ Build It Right, Or Don't Build It At All.