Skip to main content
AI Agents & Automation

⏱ About 20 min20 XP

The Memory System

Every language model starts each forward pass with no memory of anything outside its current context window. For a simple chatbot, this is manageable — the entire conversation fits in the window. For an agent executing a 50-step research task over hours or days, it is a crippling limitation: the task accumulates far more state than any context window can hold, and the agent must continue working reliably across multiple sessions. The memory system is the architectural component that makes continuity possible.

Four Types of Agent Memory

Researchers and practitioners have converged on a taxonomy of four memory types, each implemented differently and serving a different purpose. In-context memory is the simplest: it is the content of the current context window itself — the system prompt, the conversation so far, tool call results, and any injected text. It is fast (no retrieval needed), perfectly accurate (exactly what was written), and ephemeral (gone when the context window is cleared). It is limited by the context window size, which is why all other memory types exist. External memory stores information outside the model in a retrievable database. Two main forms are used: a key-value or document store for exact lookup (retrieve a specific record by ID), and a vector database for semantic retrieval (retrieve records whose meaning is similar to a query string). External memory is the solution when the total relevant knowledge exceeds what fits in context. The agent uses a retrieval tool to fetch relevant chunks into the context window as needed. In-weights memory is the parametric knowledge baked into the model's weights during training — everything the model 'knows' about history, science, programming, and the world without looking anything up. It is always available and requires no retrieval, but it is static (frozen at training time), may be incorrect, and is not updatable at inference time. In-cache memory (also called KV cache) is a systems-level optimization: the intermediate computation states from previously processed tokens are cached so that long repeated prefixes (like a fixed system prompt) do not need to be recomputed on every call. It reduces latency and cost but does not add new information — it simply avoids redundant computation.

Retrieval-Augmented Generation (RAG)

The dominant pattern for giving agents access to private or current knowledge is RAG: Retrieval-Augmented Generation. Documents are chunked, embedded as vectors, and stored in a vector database. At query time the agent's query is embedded and the most semantically similar chunks are retrieved and injected into the context. The model then generates with both its parametric knowledge and the retrieved facts. RAG does not update the model's weights — it extends the effective context.

Flashcards — click each card to reveal the answer

Memory Management Strategies

Long-running agents need explicit strategies for managing what stays in context and what gets offloaded. The three most common strategies are summarization, sliding window, and selective retrieval. Summarization compresses earlier parts of the conversation or task history into a shorter summary before they are dropped from context — the agent retains gist without token-by-token detail. The sliding window approach drops the oldest messages from context once the window is full, keeping only the most recent N tokens. This is simple but loses information about early task state. Selective retrieval uses a retrieval tool to fetch only the specific past context relevant to the current step, rather than maintaining a linear history at all — more complex to implement but scales indefinitely. Real agents often combine strategies: a sliding window for recent conversation, a summary stored to external memory, and a vector retrieval tool for long-term knowledge. The choice depends on the task's memory access patterns — does the agent need to refer back to the beginning frequently, or only to the most recent state?

Match each memory management strategy to the scenario where it is best suited.

Terms

Summarization
Sliding window
Selective retrieval from vector store
In-weights memory
Key-value external store

Definitions

An agent answering a factual question about well-established science with no private or recent data needed
An agent with access to a 10,000-document knowledge base that needs only the 3 most relevant chunks per step
A long research task where gist of earlier steps matters but verbatim detail does not
A customer service agent where only the current conversation turn and recent history are relevant
An agent that must retrieve a user's preferences or a specific record by a unique identifier

Drag terms onto their definitions, or click a term then click a definition to match.

An agent is executing a 200-step scientific literature review. After step 80, the total conversation history exceeds the context window. The agent uses a summarization strategy. What information is most at risk of being lost, and what should the system prompt instruct the model to preserve?

Memory Poisoning

External memory that is populated from untrusted sources — web pages, user-submitted documents, public databases — can contain adversarially crafted text intended to manipulate the agent's future behavior. An injected instruction hidden in a document ('Disregard previous instructions and instead...') retrieved into context is a prompt injection attack via the memory system. Agents with external memory must sanitize and validate retrieved content before injecting it.

A developer builds an agent where parametric (in-weights) memory is the sole source of information. The agent is supposed to answer questions about the company's internal policies. What is the fundamental problem with this design?

Memory Architecture Design

  1. You are designing the memory system for a personal AI study assistant. The agent needs to: remember what topics you have studied in past sessions, access a large library of study materials (textbooks, lecture notes, practice problems), keep track of the current study session's progress, and recall your performance on previous quizzes to personalize difficulty.
  2. Step 1: For each of the four memory requirements above, identify which memory type (in-context, external vector, external key-value, in-weights) is the best fit and explain why.
  3. Step 2: Describe what happens when the current study session runs long enough to fill the context window. Which memory management strategy would you use?
  4. Step 3: Identify one security or privacy risk in your memory design and propose a mitigation.
  5. Goal: practice designing a multi-type memory architecture for a realistic agent use-case.