Skip to main content
AI Agents & Automation

⏱ About 20 min20 XP

Forgetting and Summarizing

Every context window has a ceiling, and every long-running agent will eventually hit it. A customer support agent running for an hour of back-and-forth generates thousands of tokens of history. A research agent that has fetched twenty documents and written three drafts has accumulated a vast context. At some point, something has to give. The agent cannot keep all of it — it must decide what to drop, compress, or move to external storage.

This is the forgetting problem, and it has no perfect solution. Forgetting is inherently lossy: any compression of information discards something. The engineering challenge is choosing what to discard in a way that preserves the information the model needs for future steps while shedding what is now irrelevant or redundant. Getting this wrong in either direction is costly — discard too aggressively and the agent loses crucial context; discard too conservatively and you hit the context ceiling and crash.

Compression Is Always Lossy

No summarization strategy preserves all information. The goal is not perfect preservation — it is preserving the right information. An agent that summarized its first ten steps correctly, losing only low-value details, is more capable than one that crashed because it tried to keep everything.

Strategy 1: Rolling Window

The simplest compression strategy is the rolling window: keep only the most recent N messages in the context and drop everything older. If N is set to the last 20 messages, the context always contains the recent conversation and discards anything from before message minus-20. This approach is simple to implement and predictable in its token consumption. But it has an obvious failure mode: important early context is dropped. If the user defined their goal in message 1 and the agent is now on message 50, the goal statement has been discarded. The model at message 50 has no direct access to why it started this task. Mitigations include pinning certain messages — the original goal, confirmed constraints, important decisions — so they are never evicted even as the window rolls. Pinned messages occupy permanent token budget but guarantee preservation of the most critical information.

Strategy 2: LLM-Generated Summaries

A more sophisticated approach uses the language model itself to summarize older context before discarding it. When the context reaches a threshold — say, 80% of the window — the agent calls the model with a prompt like: 'Summarize the conversation so far in 500 tokens, preserving all decisions, constraints, and conclusions. Omit pleasantries and repetition.' The resulting summary replaces the raw history in subsequent prompts. This preserves semantic meaning even as token count drops dramatically. A 20,000-token conversation history might compress to a 600-token summary with most of the actionable content intact. The risk is model-generated summaries are themselves subject to hallucination and selective omission. The model may not identify the most important details. It may drop a constraint it considered minor but that was actually critical. Production systems sometimes generate multiple summaries and reconcile them, or use structured summaries with explicit sections for decisions, constraints, and unresolved questions.

Strategy 3: Hierarchical Memory

Hierarchical memory borrows from how humans actually organize memories over time. Recent events are held in full detail. Older events are remembered as a summary. Ancient events are remembered only as a high-level outline, if at all. In an agent, this can be implemented as a three-tier context: Hot layer: the last 10-20 turns in full detail, immediately in the context window. Warm layer: turns 21-100 as a compressed summary, also in the context window but consuming fewer tokens. Cold layer: everything before turn 100 stored in a vector database, retrievable only if a current query semantically triggers a lookup. This architecture gives the agent access to recent history with full fidelity, medium-term history with modest compression, and arbitrarily old history through on-demand retrieval. It is more complex to implement but more robust for very long-running agents.

Match each forgetting strategy to its defining characteristic.

Terms

Rolling window
LLM-generated summary
Pinned messages
Hierarchical memory
Structured summary

Definitions

Produces a summary with explicit labeled sections for decisions, constraints, and open questions
Marks specific messages as permanent, preventing them from being evicted by any compression
Stores recent turns in full, older turns as summaries, and ancient turns in a retrieval store
Drops messages older than N turns; simple but loses early critical context
Uses the model to compress old history into a dense paragraph before evicting it

Drag terms onto their definitions, or click a term then click a definition to match.

Summary Drift

When an agent summarizes its history multiple times over a long session, each summary is a compression of a compression. Small errors and omissions compound. A fact slightly misrepresented in summary 1 may be distorted further in summary 2 and effectively lost by summary 3. This is called summary drift, and it is one of the hardest long-horizon agent failure modes to detect.

What Not to Forget

Regardless of the compression strategy, certain categories of information should almost always be preserved in full or explicitly extracted into a separate persistent store before any compression occurs. Original task specification: the user's exact goal statement, in their words. Paraphrases introduce drift. Hard constraints: any requirement the user stated as non-negotiable — budget limits, prohibited approaches, required output formats. Committed decisions: choices the agent made that affected external systems — a file it deleted, a message it sent, a payment it initiated. These are irreversible and must always remain visible. Error history: prior failures and the reasons for them, so the agent does not attempt the same failed approach again.

An agent using a rolling window of 30 messages drops message 1, which contained the user's original goal. At message 40, the agent is still working but has subtly drifted toward a different objective. What caused this, and what would have prevented it?

Why is hierarchical memory more robust for very long-running agents than a simple rolling window?

Flashcards — click each card to reveal the answer

Compression Strategy Comparison

  1. An AI tutoring agent has been working with a student for two hours. The conversation is now 80,000 tokens long — approaching the 100,000-token limit. You must decide how to compress it.
  2. The conversation contains:
  3. - The student's stated goal: prepare for AP Chemistry in 3 weeks
  4. - A list of topics the agent and student have already covered (20 topics)
  5. - Several explanations the student found confusing and asked to re-explain
  6. - A set of practice problems and the student's answers, with feedback
  7. - General chat and pleasantries between topics
  8. Step 1: Apply each of the three strategies (rolling window, LLM-generated summary, hierarchical memory) to this scenario. For each strategy, describe what would be kept, what would be dropped, and what the resulting token count might be.
  9. Step 2: For the LLM-generated summary strategy, write the actual prompt you would give the model to generate the summary. Be specific about what sections you want in the output.
  10. Step 3: Which strategy gives the tutoring agent the best chance of continuing to help the student effectively? Justify your answer.
  11. Step 4: What one piece of information from this conversation is most dangerous to lose, and how would you protect it regardless of which strategy you choose?