Skip to main content
AI Agents & Automation

⏱ About 20 min20 XP

Retrieval-Augmented Generation

A language model's knowledge is frozen at its training cutoff. GPT-4's weights encode patterns from text collected before early 2023. Claude's weights encode patterns from text collected before its own cutoff. No matter how capable the model is, it cannot tell you what happened last week, what your company's internal policy document says, or what the current price of a stock is — because none of that was in its training data. Retrieval-augmented generation, universally abbreviated RAG, is the architectural pattern that fills this gap.

The idea is elegantly simple: instead of expecting the model to remember relevant information from training, fetch that information from an external source at query time and inject it directly into the prompt. The model then generates its response by reasoning over the retrieved content, not just its frozen weights. This makes the model's effective knowledge updatable, domain-specific, and far more accurate on specialized or current topics — all without retraining a single parameter.

RAG in One Sentence

RAG retrieves relevant external text at query time and inserts it into the prompt, allowing the model to reason over up-to-date or domain-specific knowledge it could not have learned during training.

The RAG Pipeline Step by Step

A RAG pipeline has two major phases: ingestion and retrieval. During ingestion, a corpus of documents — company wikis, research papers, product manuals, historical records — is processed offline. Each document is split into smaller chunks, typically 200-500 tokens each, so that retrieved content is focused rather than sprawling. Each chunk is then converted into a dense vector embedding using an embedding model. An embedding is a list of hundreds or thousands of floating-point numbers that captures the semantic meaning of the chunk. Semantically similar chunks cluster near each other in this high-dimensional space. The embeddings are stored in a vector database alongside the original chunk text. During retrieval, when a user sends a query, the same embedding model converts the query into a vector. The vector database performs a nearest-neighbor search — finding the stored chunks whose embeddings are closest to the query's embedding in vector space. The top-k most similar chunks are retrieved and inserted into the prompt before the model generates a response. The model now has both the user's question and the most relevant context passages, and it synthesizes an answer from both.

Place each RAG pipeline step in order by matching it to its stage description.

Terms

Split source documents into 200-500 token chunks
Convert each chunk to a dense vector using an embedding model
Store vectors and chunk text in a vector database
Embed the user's query and search for the nearest stored vectors
Insert retrieved chunks into the prompt before model generation

Definitions

Ingestion: indexing
Retrieval: similarity search
Ingestion: embedding
Ingestion: chunking
Retrieval: context injection

Drag terms onto their definitions, or click a term then click a definition to match.

Embeddings: Meaning as a Point in Space

The concept of an embedding deserves a closer look. An embedding model — such as OpenAI's text-embedding-3-large or Cohere's embed-english-v3.0 — is a neural network trained to produce vectors such that semantically similar text ends up near each other in the vector space. The sentence 'the dog chased the ball' will be embedded near 'a puppy ran after a toy' even though no individual word matches. 'The stock market fell sharply' will be far from both. This property — semantic proximity reflected in geometric proximity — is what makes RAG work. Keyword search would require the query and the document to share exact words. Embedding-based search finds conceptually related content even when the vocabulary differs entirely. For knowledge-base question answering, this is decisive: users ask questions in their own words; documents are written by authors using different vocabulary.

Chunk Size Tradeoffs

Smaller chunks (100-200 tokens) are more precise — a retrieved chunk is more likely to be entirely relevant to the query. Larger chunks (500-1000 tokens) preserve more context — the retrieved passage makes sense on its own without surrounding text. Most production RAG systems experiment with chunk size empirically for their specific corpus and query distribution.

When RAG Is and Is Not the Right Tool

RAG is exceptionally well suited to question answering over a known corpus: customer support over a product manual, legal research over a case database, medical information retrieval over clinical guidelines. It is the right tool when the answer exists verbatim or near-verbatim in stored documents, and when freshness matters — the corpus can be updated without retraining the model. RAG is a poor fit when the required knowledge is genuinely absent from the corpus — then the model may confidently synthesize an answer from irrelevant retrieved chunks, a failure called hallucination from bad retrieval. RAG also struggles when questions require complex multi-hop reasoning across many documents simultaneously, since the retrieved chunks may not include all the necessary links. And RAG adds meaningful latency: embedding the query, searching the vector store, and injecting results all take time before the model even begins generating.

A legal tech company builds a RAG system over 50,000 court decisions. A user asks about a ruling from last month. The model confidently gives an answer, but the ruling from last month was issued after the corpus was last updated. What is the most likely failure mode?

Why does RAG produce better results than relying purely on a model's training weights for company-specific internal knowledge?

In a RAG pipeline, documents are first split into , then converted into dense using an embedding model, and stored in a database. At query time, the user's query is also embedded and the most chunks are retrieved and injected into the prompt.

Design a RAG System

  1. You are building a RAG-powered study assistant for high school AP Biology students. The corpus is a set of 1,200-page AP Biology textbook pages.
  2. Step 1: Describe your chunking strategy. What is your target chunk size and why? Will you split by sentence, paragraph, section heading, or page? What are the tradeoffs of your choice?
  3. Step 2: A student asks: 'How does the sodium-potassium pump create a membrane potential?' Write the ideal retrieved chunk that should appear in the prompt for this query — describe its content in 3-4 sentences.
  4. Step 3: Now consider a failure case. A student asks: 'Will this be on the AP exam?' How does your RAG system respond, and what does this reveal about RAG's limitations?
  5. Step 4: Propose one additional data source you would add to the corpus to make the study assistant more useful. Justify why it improves retrieval quality for AP Biology students.