Why LLM Context Windows Fill Up Faster Than You Think

The token arithmetic behind every LLM conversation — what consumes space before the user types a word, why performance collapses before you hit the limit, and how to manage it.

You build something with GPT-4o. The model supports 128,000 tokens. You think: that's enough for a full novel. Then, four or five conversation turns in, the model starts forgetting things that were said earlier. Eight turns in, you hit an error. You check the token count — you've used over 100,000 tokens, and you've typed maybe 400 words.

This isn't a bug. It's the predictable consequence of not accounting for where those tokens actually go. A context window isn't blank space waiting to be filled with your words. By the time the first user message arrives, it is already partially consumed — by system instructions, by tool definitions, by retrieved documents, by the tokens the model itself generated in earlier turns. In a production AI agent, 30–60% of the context window is gone before a user types anything.

This post is about understanding that consumption precisely: where tokens go, why the effective limit is lower than the advertised limit, what happens to quality as the window fills, and how to manage all of it in practice.

Part 01
The Problem
01

The Illusion of Abundance

The advertised context window sizes of modern models are genuinely impressive. GPT-4o supports 128K tokens. Claude 3.5 supports 200K. Gemini 1.5 Pro has been demonstrated at 1 million tokens. A million tokens is roughly 750,000 words — about ten average novels. How could you possibly run out?

The answer begins with a calibration exercise. What is 128,000 tokens, really?

In English prose, one token is approximately four characters, or about three-quarters of a word. A paragraph of 100 words is roughly 133 tokens. A thousand-word article is about 1,300 tokens. At that rate, 128,000 tokens can hold roughly 96,000 words of clean English text — and yes, that sounds like a lot.

But text in an LLM application is rarely clean English prose. It is JSON payloads from tool calls. It is API responses full of structured data. It is code. It is URLs. It is conversation history with speaker labels, timestamps, and formatting. All of these serialize into tokens at rates much higher than 4 characters per token, for reasons we will examine in the next section.

Then there is the question of performance. The advertised number represents a technical limit — the longest sequence the model can physically process. It does not represent the length at which the model operates at peak accuracy. Research has repeatedly found a gap between the two. In a 2025 evaluation study, 11 out of 13 major LLMs dropped below 50% of their baseline accuracy scores at just 32K tokens on complex tasks. GPT-4o, one of the most capable models, fell from 99.3% to 69.7% at that same threshold. The technical limit says 128K. The accuracy cliff arrives much earlier.

The Effective Limit Is Not the Advertised Limit

Models claiming 200K context windows show measurable quality degradation around 130K tokens in practice. Treating the advertised number as your operating budget is how production systems quietly degrade without triggering any explicit error.

There is also a third factor: cost. Every token you put in the context window is a token you pay for. On most commercial APIs, input tokens are billed per thousand. At GPT-4o's pricing, 128K tokens of input costs several dollars per call. An agent running dozens of calls per session — with full context each time — reaches surprising monthly bills quickly. Careless context management is not just a quality problem; it is a cost problem.

Understanding why this happens requires going back to basics: the mechanics of tokenization, and what actually occupies the context window in a real application.

02

How Tokens Are Counted — and Why the Count Surprises You

An LLM does not read text. It reads a sequence of integers. Before any word reaches the model, it passes through a tokenizer that converts characters into integer IDs from a vocabulary of roughly 50,000–200,000 entries. The tokenizer used by GPT-4 and GPT-4o is called cl100k_base; it has about 100,000 vocabulary entries. OpenAI's newer models use o200k_base, with about 200,000.

The vocabulary is built using BPE — Byte Pair Encoding. The algorithm is trained on a large corpus of text and learns which character sequences appear together so often that they deserve their own slot in the vocabulary. Common English words like "the," "is," "running" become single tokens. Rare words are split into subword pieces: "tokenization" might become ["token", "ization"]. Individual characters and bytes are the fallback for anything truly novel. Because BPE is trained on text, it learns efficient representations for the kinds of sequences that appear frequently in training data — which is overwhelmingly natural English prose. Anything that isn't natural English prose — JSON brackets, URL slashes, code indentation — was rarer in training, so BPE never learned to merge those sequences aggressively. That is why these content types tokenize more expensively.

The rule-of-thumb ratio of 1 token ≈ 4 characters holds reasonably well for clean English text. But it falls apart under several conditions that appear constantly in real applications.

Numbers tokenize unexpectedly

BPE learns tokens from frequency in training data. The number "2023" is common in training data — it became a single token. But "2026" is less common, and "19847" is rare — these get split into per-digit or per-pair tokens. Consider: the price "USD 1,234,567.89" is not 18 characters of dense data. It is approximately 10–12 tokens, because the commas, period, digits, and currency symbol may each claim separate tokens depending on their adjacency in the vocabulary.

URLs are disproportionately expensive

A URL like https://api.example.com/v2/users/12345 looks compact — 38 characters, which by the prose rule should be about 9–10 tokens. In practice it is closer to 15–20 tokens. Slashes, dots, hyphens, underscores, and alphanumeric path segments each claim their own tokens or merge into small fragments — but the fragments are small because URLs are structurally uncommon in prose, so BPE never learned to merge them aggressively.

JSON and structured data use roughly 2x the token count of plain text

Consider two ways to express the same fact:

Plain text: The user's name is Alice, she is 28 years old, and her account is active.
JSON:       {"user": {"name": "Alice", "age": 28, "status": "active"}}

The plain text version: approximately 18 tokens. The JSON version: approximately 22 tokens — and this is a trivially small object. Real API responses with deeply nested keys, repeated field names, and verbose formatting can be far more expensive. Every brace, colon, and comma is a token or part of a token. A 500-word JSON payload can use 800+ tokens, well above what the character count would suggest.

Code tokenizes inefficiently in some languages

Research found that Python uses roughly 46% more tokens than equivalent Haskell to express the same computational idea. This is partly because Python's indentation-based structure requires whitespace tokens, and partly because Python's identifiers and keywords were less densely represented in the pre-GPT-4 training corpora.

Analogy: The Luggage Weight Problem

Think of the context window as checked baggage with a weight limit, not a size limit. A suitcase full of dense sweaters weighs less than one with foam packing material filling the same volume. Plain prose is the dense sweaters — you pack a lot of meaning into few tokens. JSON, URLs, and code are the foam — structurally bulky, meaning-sparse, yet they count toward the same limit.

The interactive demo below lets you see this directly. Type or paste different kinds of text and watch the token count change with the type of content — observe how the ratio shifts when you enter a URL, a JSON snippet, or a block of code compared with an equivalent sentence of English prose.

Interactive Demo 01
Token Density Explorer
Observe how different content types — prose, JSON, code, URLs — produce very different characters-per-token ratios. Type or paste your own text to measure it.
Characters
Est. Tokens
Chars / Token
% of 128K window
Token counts are estimated using a BPE approximation. Actual counts may vary ±5% from the real tiktoken result for structural text.
Part 02
The Consumers
03

The Four Layers That Eat Your Context Window

Every LLM API call sends a full context payload to the model. That payload is assembled from four distinct layers. Most developers think about only one of them — the user's message. The other three are silent consumers that accumulate before the user types a single character.

Layer 1: The System Prompt

The system prompt is the foundational layer. It is always present, on every API call. A minimal system prompt — "You are a helpful assistant" — costs about 7 tokens. But real production system prompts are not minimal.

A typical customer-facing chatbot system prompt contains: the model's persona and tone guidelines (3–5 paragraphs), a list of topics it should and should not address, instructions about response format, any domain-specific knowledge baked in as facts, legal disclaimers about the model's limitations, and formatting instructions for different response types. Measured in practice, these range from 800 to 2,500 tokens. They are charged on every single API call. A 1,500-token system prompt running 1,000 calls per day costs you 1.5 million input tokens per day before a user says anything.

Layer 2: Tool Schemas

When you give an LLM access to external tools — a web search function, a database query, a calculator, an API call — you must describe each tool to the model in the context window. The description tells the model what the tool does, what arguments it accepts, and what each argument means.

These descriptions are written in JSON and can be verbose. A moderately documented tool schema looks like:

{
  "name": "search_documents",
  "description": "Searches the internal knowledge base for documents matching the query. Returns a ranked list of excerpts with source URLs.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "The search query. Should be a concise natural language question."
      },
      "max_results": {
        "type": "integer",
        "description": "Maximum number of results to return. Default 5, max 20."
      },
      "filter_date": {
        "type": "string",
        "description": "ISO 8601 date string. If provided, only returns documents published after this date."
      }
    },
    "required": ["query"]
  }
}

This single schema: roughly 200 tokens. An agent with five tools carries around 1,000 tokens of tool descriptions on every call, before any user input. The JSON structure alone — all those braces, colons, and quoted keys — is part of why the token cost is higher than reading the description would suggest.

The Hidden Fixed Cost

System prompt + tool schemas is your fixed cost floor. It doesn't change turn-to-turn. It can easily reach 2,000–4,000 tokens in a real agent — that's 1.5–3% of a 128K window before the conversation begins. Small fractions, yes, but they multiply across every single API call in your fleet.

Layer 3: Retrieved Context (RAG)

Many production LLM applications use RAG: retrieving relevant documents from a database and injecting them into the context window as supporting material for the model's answer. This is one of the largest and most variable token consumers.

A typical RAG retrieval returns 3–8 document chunks. Each chunk is typically 300–600 tokens, chosen to be the right size for embedding and retrieval. Three chunks at 400 tokens each: 1,200 tokens. Eight chunks at 500 tokens each: 4,000 tokens. In a research assistant with a generous retrieval budget, you might inject 8,000–12,000 tokens of context per query. That's 6–10% of a 128K window, every turn.

The retrieved content itself is usually prose, so tokenization is efficient. But if your knowledge base contains code, JSON configuration files, or API documentation, retrieved chunks will tokenize at the worse ratios we saw in Section 2.

Layer 4: Conversation History

This is the layer people think about the least and the one that creates the most painful surprises. In an LLM chat system, the model has no persistent memory. It does not remember previous turns. You create the illusion of memory by re-sending the full conversation history on every API call.

Turn 1: you send [system] + [user message 1]. The model replies. Turn 2: you send [system] + [user message 1] + [assistant reply 1] + [user message 2]. Every turn appends two new entries (a user message and a model response) to a history that is re-sent in its entirety.

Model responses, crucially, can be long. A detailed answer with a code snippet might be 600–800 tokens. An explanation with three bullet points might be 400 tokens. After ten exchanges, the conversation history alone can be 8,000–12,000 tokens, plus the fixed costs from layers 1 and 2, plus RAG context — all before the user types their next message.

04

Context Creep — Watching the Window Fill

The process by which a context window fills over a conversation has a name in production systems: context creep. It is not sudden. It is a steady, per-turn accumulation driven primarily by the growing history layer.

The chart below traces a realistic conversation in a customer support agent. The agent has: a 1,200-token system prompt, three tool schemas totaling 600 tokens, and RAG retrieval returning two chunks (~800 tokens per turn). User messages average 60 tokens; model responses average 350 tokens.

With default parameters (1,200-token system prompt, 600 tool tokens, 800 RAG per turn, 60-token user messages, 350-token model replies), the window hits 80% full at around turn 45 and maxes out near turn 82. Change the model reply length to 800 tokens — simulating an agent that produces detailed answers — and those numbers roughly halve. Watch how the window fills as turns accumulate. The fixed layers (system + tools) appear immediately. The RAG context adds a constant per-turn cost. The history layer grows relentlessly:

Interactive Demo 02
Context Creep Simulator
Adjust the parameters for a hypothetical AI agent and see exactly how many conversation turns are available before the context window is exhausted.
1200 tok
600 tok
800 tok
60 tok
350 tok
Turns until 80% full
Turns until 100%
Fixed overhead
80% is used as a practical limit — beyond that, quality degradation becomes significant.
Part 03
The Physics
05

KV Cache Memory — Why Context Has a Physical Cost

The context window limit is not an arbitrary policy. It is enforced by physics — specifically, GPU memory. To understand why, we need to understand what happens computationally when a model processes a long context.

The transformer's attention mechanism works by comparing every token in the context with every other token: for each token, it computes a query vector q, and compares it against the key vectors k of all other tokens to produce attention weights, which are then applied to value vectors v. The operation requires three matrices for every token: a key matrix K, a value matrix V, and a query matrix Q.

Attention Formula

Attention(Q, K, V) = softmax(QKT / √dk) · V

Where Q, K, V are query, key, value matrices, and dk is the key dimension. The QKT product is the expensive part: it is an n × n matrix where n is the sequence length. Doubling n quadruples this computation. This is the prefill phase — processing the full input prompt — and it is inherently O(n²) per layer.

There are two distinct computational phases in LLM inference. Prefill processes the entire input prompt at once — this is O(n²) per attention layer, and for long prompts it is the dominant compute cost. Decode generates one token at a time, attending only to the current token against the cached history — this is O(n) per step with the KV cache. Without caching, decode would also be O(n²) because you'd recompute the full history on every step. The KV cache converts decode from O(n²) to O(n) at the cost of memory.

The KV cache solves the compute problem: it stores the K and V tensors for all previously processed tokens so they can be reused without recomputation. Generation becomes O(n) instead of O(n²) per step.

But caching has a cost: memory. Every token's K and V tensors must be stored in GPU VRAM for the duration of the conversation. The memory requirement grows linearly with context length:

KV Cache Memory Formula (Multi-Head Attention)

KV_memory = 2 × n_layers × n_heads × d_head × seq_len × bytes_per_param

For a 7B-parameter model using standard MHA (32 layers, 32 heads, head_dim 128) at bfloat16 (2 bytes):

KV_memory per token ≈ 2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 0.5 MB

At 128K context: 0.5 MB × 128,000 = 64 GB of KV cache alone.

Note on GQA: Most modern models (Llama 3, Mistral, GPT-4o) use GQA (Grouped-Query Attention), which reduces the KV cache by sharing key-value heads across groups of query heads. A model with 32 query heads and 8 KV heads (4× reduction) brings the per-token cache cost from 0.5 MB to ~0.125 MB — about 16 GB at 128K context. Significant, but still the dominant memory consumer at long contexts.

This is why long-context inference is expensive: not because computation is expensive (the KV cache handles that), but because memory is expensive and finite. At 128K context, the KV cache for a 7B MHA model already consumes 64 GB — more than the model weights at bfloat16 (~14 GB). GQA-equipped models do better, but the cache still dominates at long contexts. Getting to 1M tokens requires either multi-GPU setups, aggressive quantization of the KV cache, or offloading it to slower memory tiers.

Larger models have larger caches. A 70B model with 80 layers and wider heads can require 500 MB per token in the KV cache. At 128K context: 64 TB theoretically — which is why providers either cap context length aggressively for large models, or charge steeply for long context calls.

Why Context Costs More Than Compute

Each additional token in the context doesn't just add computation — it adds permanent VRAM occupancy for the duration of the session. Once tokens are in the KV cache, they can't be removed without discarding the cache and reprocessing the context from scratch. This is why context management isn't optional — it's a direct lever on infrastructure cost.

Prompt caching: the production shortcut for fixed costs

OpenAI, Anthropic, and Google all offer prompt caching: if you send the same system prompt (or any long prefix) repeatedly, the provider caches the computed KV activations on their servers. Subsequent calls that begin with the same prefix pay a reduced per-token price for those cached tokens (typically 50–75% cheaper) and benefit from lower latency because the prefill phase for the cached portion is skipped.

This has a direct implication for the fixed-cost problem: a stable system prompt with thousands of tokens, sent on every API call, is an ideal prompt caching candidate. The first call computes and caches the KV activations for the system prompt; every subsequent call reuses them. Providers automatically manage the cache — you don't implement anything special, you just ensure the cacheable prefix is identical across calls (same characters, no dynamic content injected into the system prompt).

KV cache quantization: the emerging workaround

One active area of production optimization is quantizing the KV cache — storing the K and V tensors in lower-precision formats (int8 or int4) rather than float16. This can cut KV cache memory by 2–4x with modest accuracy penalties. Research like KVQuant explores going to 2-bit precision for certain layers while keeping others at higher precision, targeting 10M token contexts on commodity hardware. The tradeoff is that quantization introduces small errors into the attention computation — tolerable for most use cases, problematic for tasks requiring precise recall of specific facts in long documents.

06

Lost in the Middle — Why Performance Collapses Before You Hit the Limit

Suppose you solve the memory problem. You have access to a model with a 200K context window, plenty of VRAM, and a context that's only 60% full. You're still not safe. There's a second problem, separate from memory: the model doesn't give equal attention to all parts of the context.

In 2023, researchers at Stanford and UC Berkeley published a study titled "Lost in the Middle." They gave LLMs a task requiring them to find a specific document from a set of twenty documents, all injected into the context window. The position of the relevant document was varied systematically: sometimes at the beginning, sometimes at the end, sometimes buried in the middle.

The result was stark. When the relevant document was first or last, models retrieved it accurately. When it was in the middle positions, accuracy dropped by more than 30%. The longer the context, the worse the effect. It is worth noting that newer models — particularly those fine-tuned on long-context data like Claude 3.5 and GPT-4o — have partially mitigated this bias through training. But "partially" is the key word: independent evaluations continue to find meaningful position-dependent performance gaps in all current models, even at lengths well within their advertised limits. The effect is model-dependent and should be tested empirically for your specific application.

Analogy: The Lecture Hall Effect

Imagine a professor delivers a two-hour lecture. Students reliably remember the opening — they're fresh, attentive, taking notes. They remember the closing — the summary, the final emphasis, the exit. What happened in the middle of hour one? It's murky. The professor said things, but the attention curve dipped. LLMs have an analogous concentration pattern: strong attention to the beginning and end of the context, with a trough in the middle.

The mechanism is structural. Transformers use RoPE (Rotary Position Embedding) in most modern architectures to give tokens awareness of their position in the sequence. RoPE encodes position as a rotation applied to query and key vectors, and the dot product between two RoPE-encoded vectors naturally decays as the distance between them increases. This is by design — nearby tokens are expected to be more relevant than distant ones. But the decay becomes a liability at long contexts, especially for tokens in the middle that are far from both the beginning and the model's current generation position at the end.

The practical implication: even when content fits in the window, placement matters. Information you want the model to reliably use should be near the beginning or near the end of the context. Information buried thousands of tokens deep in the middle is at risk of being effectively invisible.

A Subtle RAG Bug

If your RAG system retrieves 8 documents and inserts them in the middle of a long conversation history, the most relevant chunks may be in the attention trough. The model generates a response, you see no error, but the answer doesn't reflect those documents. The failure is silent — the model doesn't say "I couldn't read that." It just uses what it attended to, which may be the surrounding conversation history instead.

Follow-up research ("Found in the Middle," 2024) proposed architectural mitigations that calibrate the positional bias during fine-tuning, recovering up to 15 percentage points of accuracy. But these require modifying or fine-tuning the model — they're not available as a prompt-engineering trick. What you can do in practice is discussed in the strategies section.

Context length alone hurts performance even with perfect retrieval

A separate 2025 finding is worth noting: context length degrades performance even when the retrieval is perfect — when the relevant information is guaranteed to be in the context. The longer the surrounding irrelevant context, the more the model's attention is distributed across noise, leaving less effective attention for the signal. This effect is called context dilution. The analogy: finding one red marble in a bag of ten marbles is easy; finding it in a bag of ten thousand is not, even if you know it's there.

The interactive demo below lets you explore the relationship between context position and attention reliability, using a simplified attention visualization based on the U-shaped curve observed in research.

Interactive Demo 03
Attention Position Heatmap
Observe how reliable attention varies across positions in the context window. Drag the slider to see where in the window a piece of information is placed, and the estimated attention reliability it receives. The U-shaped curve reflects the "lost in the middle" phenomenon measured in research.
32K tok
50% in
Reliability at position
Best position for this context length First 10% or last 10%
Middle trough depth
Reliability estimates are based on the relative accuracy curves from the "Lost in the Middle" paper (Stanford/Berkeley, 2023). Actual numbers depend on model, task, and content.
Part 04
Solutions
07

Token Budget Math — Calculating Your Real Available Space

The prerequisite for managing a context window is knowing precisely where the tokens are going. This requires building a budget — an explicit accounting of every zone in the context and how many tokens it receives.

Think of the context window as a table with five rows:

Zone What it contains Typical token range Fixed or variable?
System Prompt Instructions, persona, rules, domain knowledge 500–2,500 Fixed per application
Tool Schemas JSON descriptions of available functions/tools 200–400 per tool Fixed per agent
RAG Context Retrieved document chunks injected before the query 0–12,000 Variable per turn
Conversation History All prior user messages and assistant responses 0 → grows Grows each turn
Generation Reserve Space for the model's next response 500–2,000 Reserved explicitly

The total of all five zones must be less than your context limit. The generation reserve must be reserved explicitly — it is not "used up" by input, but if your prompt consumes the entire window, the model either generates nothing or truncates its response.

A worked example. Application: a customer support agent. Model: GPT-4o (128K context).

Context budget allocation:
  Total window:          128,000 tokens
  System prompt:          -1,400 tokens  (measured)
  Tool schemas (4 tools):   -800 tokens  (measured)
  Generation reserve:     -1,500 tokens  (set by us)
  ─────────────────────────────────────────
  Available for dynamic:  124,300 tokens

  Of that:
    RAG budget:           20,000 tokens  (5 chunks × 4,000 avg)
    History budget:       ~104,300 tokens (fills over time)

  ─────────────────────────────────────────
  Turns until 80% full:
    80% of 128K = 102,400 prompt tokens
    Fixed overhead = 1,400 + 800 = 2,200
    Per-turn RAG = 800
    Per-turn growth = user avg (60) + model avg (350) = 410
    Turns until (2,200 + n × 800 + n × 410) ≥ 102,400
    n × 1,210 ≥ 100,200
    n ≈ 82 turns

82 turns sounds comfortable. But this assumes constant 350-token model replies. A user who triggers several detailed answers — tables, code snippets, bullet-heavy explanations — can double the history growth rate, cutting that to ~41 turns before the 80% degradation threshold.

Measure, Don't Estimate

The system prompt and tool schema token counts must be measured with the actual tokenizer — not estimated from character counts. The OpenAI Python SDK exposes a usage field on every API response with prompt_tokens and completion_tokens. Log these from day one. The distribution of prompt_tokens over time is your context growth curve.

The interactive budget allocator

The demo below lets you experiment with your own token budget. Adjust each zone and see the distribution change. The goal: understand what fraction of your window each layer consumes on a typical call, so that when you need to cut, you know which lever to pull.

Interactive Demo 04
Token Budget Allocator
Set the token costs for each context zone and see how the window fills. This shows the budget picture at a specific conversation turn — not over time, but at a single API call snapshot.
1400
800
4000
8000
1500
Sys
Tools
RAG
History
Reserve
Free
System prompt
Tool schemas
RAG context
History
Generation reserve
Free
Used
Free tokens
Status
08

Four Strategies for Managing Context Window Limits

The context window is finite and fills up. The question is how gracefully you manage that filling. Four strategies exist, each operating at a different layer and offering different tradeoffs.

01
Sliding Window
Keep only the most recent N turns of conversation verbatim. Older turns are dropped. Simple to implement; preserves recent context perfectly; loses older context entirely.
Low complexity
02
Hierarchical Summarization
Keep recent turns verbatim; compress older turns into a rolling summary. The summary holds facts without full token cost. Requires an LLM call to summarize — adds latency and cost.
Medium complexity
03
Token Compression
Use a compression model (e.g. LLMLingua) to remove low-entropy tokens from prompts, achieving 2–3× compression with minor accuracy loss. Best for compressing system prompts and RAG context.
Medium complexity
04
Embedding-based Retrieval
Store conversation history and documents as dense vectors. At each turn, retrieve only the most relevant prior context — effectively a per-turn RAG over your own conversation history.
High complexity

Strategy 1: Sliding Window

The sliding window is the simplest and most commonly deployed strategy. You set a maximum history length — say, the last 20 conversation turns — and enforce it by truncating older turns when the limit is reached. The 21st turn in means the 1st turn out.

Implementation is straightforward:

MAX_HISTORY_TURNS = 20  # configurable

def build_messages(system_prompt, history, new_message, rag_chunks):
    # Enforce sliding window on history
    trimmed_history = history[-MAX_HISTORY_TURNS:]

    # Assemble context
    messages = [{"role": "system", "content": system_prompt}]

    if rag_chunks:
        context_block = "\n\n".join(rag_chunks)
        messages.append({"role": "system", "content": f"Context:\n{context_block}"})

    messages.extend(trimmed_history)
    messages.append({"role": "user", "content": new_message})

    return messages

Note that this example uses turn count as the truncation criterion. In production, you should truncate based on token count, not turn count — a 5-turn history could range from 500 tokens to 8,000 tokens depending on how much the model generated in each response. The practical implementation measures prompt_tokens from the previous API call and truncates history from the oldest end when the projected total would exceed the budget.

The drawback of the sliding window is abrupt forgetting: when turn 1 drops, any fact established in that turn is gone. If the user's name was introduced in the first message, the model will forget it. For short-lived task-completion agents, this is fine. For long-running conversational assistants, it creates visible gaps.

Strategy 2: Hierarchical Summarization

Hierarchical summarization preserves the spirit of old context without its token weight. The architecture keeps a buffer of the last K verbatim turns, plus a rolling summary of all turns before that. When the buffer exceeds K turns, the oldest turn is compressed into the summary.

async def maybe_compress_history(history, summary, buffer_size=10):
    verbatim_turns = history[-buffer_size:]
    turns_to_summarize = history[:-buffer_size]

    if not turns_to_summarize:
        return history, summary

    # Compress older turns into summary
    new_summary = await llm.complete(
        f"Existing summary: {summary}\n\n"
        f"New exchanges to incorporate:\n{format_turns(turns_to_summarize)}\n\n"
        "Update the summary to include these exchanges. "
        "Preserve all concrete facts, decisions, and commitments. "
        "Drop conversational filler. Be dense. Max 300 words."
    )

    return verbatim_turns, new_summary

The cost: one additional LLM call per compression cycle. If your buffer size is 10 turns, you trigger compression every 10 turns — a minor overhead in most applications. The summary itself should be capped at a fixed token budget (200–400 tokens) so it doesn't become its own creep problem.

A critical implementation detail: the summary model call must be non-blocking with respect to the user's message. Don't make the user wait for summarization. Run it asynchronously, or trigger it after the current turn's response has been sent.

Strategy 3: Token Compression with LLMLingua

LLMLingua and its successors (LLMLingua-2, LongLLMLingua) approach the compression problem at the token level rather than the content level. Instead of summarizing, they identify individual tokens in the prompt that carry low information content — tokens the model could predict easily from context — and remove them.

The result is a compressed prompt that looks slightly garbled to a human but retains its semantic content for the LLM. Research benchmarks show 2–3× compression with accuracy loss under 5% on most tasks. The most effective targets are:

  • Verbose system prompts — lengthy instructions with redundant phrases and hedging language
  • RAG context chunks — retrieved passages often contain boilerplate text surrounding the key facts
  • Few-shot examples — demonstration examples can often be compressed without losing the pattern they demonstrate

Token compression is not suitable for content where every word matters: code that will be executed, legal clauses, precise numbers, or any content where errors compound. Research benchmarks showing under-5% accuracy loss are measured on information-retrieval and general QA tasks with ample context — for tasks where precision is critical, test compression in your specific domain before deploying it. It should not be applied to user messages — compressing user input changes the user's meaning before the model sees it.

Don't Compress the User's Message

LLMLingua operates on the entire prompt. If you run compression on a prompt that includes the user's current message, you may alter what the user said before the model sees it. Apply compression only to system, tool, RAG, and history zones — never to the current user turn.

Strategy 4: Embedding-based Retrieval Over History

This is the most architecturally sophisticated strategy and the most powerful for long-running agents. Rather than keeping a sliding window or a summary, you store every conversation turn as a dense vector embedding. At each new turn, you embed the current user message and run similarity search over the history embeddings to retrieve the most relevant prior exchanges.

The effect: only the conversation history that is relevant to the current question enters the context window. A user asking "what was the budget we discussed?" triggers retrieval of the turns where budget was mentioned — even if those turns happened fifty exchanges ago and would have been dropped by a sliding window.

This is Conversational RAG — the same mechanism as document RAG, applied to the conversation itself instead of a knowledge base. Here is how it works concretely: as each conversation turn completes, you embed that turn (the user's message plus the assistant's response, concatenated) into a dense vector and store it alongside the turn's full text. When the next user message arrives, you embed it and run a nearest-neighbor search over all stored turn embeddings. The top-k turns by similarity — maybe 3–5 — are retrieved as text and injected into the context window, alongside only the last 2–3 verbatim turns for coherence. All other history stays in the vector store, never entering the context.

The retrieval logic mirrors exactly what a document RAG system does for knowledge retrieval, except your "documents" are conversation turns. This requires:

  • An embedding model to vectorize turns (e.g. text-embedding-3-small)
  • A vector store for history embeddings (in-memory for short sessions; persistent across sessions for long-running assistants)
  • A retrieval call per user message to find the top-k relevant history turns before building the context

The tradeoff is complexity and latency. Each turn now requires an embedding call and a vector search before building the context. For most web applications, this adds 50–150ms of latency — acceptable. For real-time voice applications, it may not be. Also note that the four strategies are not mutually exclusive: production systems often combine them. A sliding window of 5–8 verbatim turns plus a rolling summary plus retrieval from the older history covers all distance scales simultaneously.

Applied to AI coding agents: cross-session working memory

Embedding-based retrieval over history has an exact analogue in AI coding tools. When an AI code assistant spends a session researching a codebase — tracing call graphs, locating edge cases, identifying key file paths — that knowledge exists only in the session's context window. When the session ends, the knowledge is gone. The next session re-discovers everything from scratch, re-reading the same files and re-grepping the same symbols.

A benchmark across the Apache Camel codebase (5,856 files) measured this re-discovery cost directly. Without session memory, a second-session implementation agent made 51 tool calls to rebuild its mental model — and still failed on one task, producing 0 bytes. With structured notes stored via vectr_remember during the research session and retrieved via vectr_recall at the start of the implementation session, the same tasks completed at −40% input tokens and −58% cost. The mechanism is the same as Conversational RAG: store context as structured notes, retrieve only what is needed, avoid re-computing what is already known.

09

The Practical Playbook

Knowing the strategies is necessary but not sufficient. The practical question is: for a given application type, what should you actually do? Below is the decision logic for the most common cases.

Short task-completion agents (under 20 turns expected)

Use a sliding window of the last 10–15 turns. The task will complete before history becomes a problem. Reserve your optimization effort for fixed-cost reduction: measure your system prompt and tool schemas, remove redundant language, consider whether all tools need to be registered for every call or if you can load tools dynamically based on the current turn's context.

Long-running conversational assistants

Implement hierarchical summarization. Keep 8–12 verbatim turns. Run summarization asynchronously when history exceeds 15–20 turns. Cap summaries at 400 tokens. This keeps the history zone bounded while preserving continuity. Combine with a fixed-cost audit: every 100 turns or so, review whether the system prompt has grown through edits — prompt creep is real.

Document-heavy research assistants (heavy RAG use)

The RAG zone is your biggest variable. Three tactics: (a) limit retrieval to 3–5 top chunks rather than 8–10; (b) apply token compression to chunks before injection — a RAG chunk compressed 2× means twice as many sources can fit; (c) sort retrieved chunks so the most relevant appears first and last in the context (exploiting the primacy/recency attention bias), not in the middle.

Production AI agents with many tools

Dynamic tool registration reduces the tool schema overhead significantly. Instead of registering all tools upfront, analyze the current conversation turn to determine which tools are likely needed, and only include schemas for those. A routing classifier — even a simple keyword matcher — can reduce a 10-tool schema overhead from ~2,000 tokens to ~400 tokens on most turns. More sophisticated: use the model itself to route, with a cheaper, smaller model making the tool-selection decision before the main model call.

What to monitor in production

Three metrics to track on every API call:

  • prompt_tokens / context_limit — the context utilization ratio. Alert above 70%; act above 80%.
  • prompt_tokens by zone — instrument your context-building code to tag each zone's contribution. When total grows, you need to know which zone is responsible.
  • Quality signals near high utilization — track user satisfaction scores, task completion rates, or whatever quality signal your application has, segmented by context utilization. You may find quality degrades measurably above 60% utilization in your specific application.
The Core Mental Model

Treat the context window as a RAM budget, not a document store. RAM budgets are managed explicitly: you know what is loaded, you evict what is no longer needed, you measure what you're using, and you alert before you run out. Applying that same discipline to context windows is the difference between a system that degrades silently and one that remains reliable at scale.

Positioning for the "lost in the middle" effect

One architectural recommendation that costs nothing in tokens but requires discipline: order matters. When assembling your context, place the most important information at the positions of highest attention reliability — the beginning and the end. The default ordering used by most frameworks (LangChain, LlamaIndex) is: system → history → RAG chunks → user message. This is natural but suboptimal: the user's current question benefits from recency, but the most relevant retrieved chunks are buried in the middle between old history and the question.

A deliberate alternative: system → recent history (most-recent last) → RAG chunks (most relevant last, adjacent to the user message) → current user message. The most relevant RAG chunk and the current question sit adjacent at the end of the context, within the recency attention peak. The system prompt anchors the beginning. Older history — the least relevant content — occupies the lower-attention middle. This is a deviation from framework defaults and requires explicit control over message assembly, but it is supported on all major APIs through the messages array.

Conclusion

The Window Is a System Resource

A context window is not a convenient storage area. It is a constrained compute and memory resource with a capacity limit, a quality degradation curve that precedes that limit, and a physical cost that scales with how much of it you use.

Every production LLM application consumes context from four layers simultaneously — system prompt, tool schemas, RAG context, and conversation history — and the sum of those layers can quietly exhaust the window long before users run out of things to say. The fix is not a bigger context window (though that helps with headroom). The fix is deliberate budget management: measure each zone, set explicit limits, implement a context manager, and monitor the utilization ratio in production.

The attention degradation problem — "lost in the middle" — adds a second dimension: even when your window is not full, the quality of the model's use of context depends on where in the window the important information sits. Primacy and recency are real, measurable biases that application design can exploit or fall victim to.

The strategies — sliding window, hierarchical summarization, token compression, embedding-based retrieval — exist on a complexity spectrum. Simple applications can start with a sliding window and graduate to more sophisticated management as they scale. What matters is that the management is explicit, not accidental.

Context engineering is not a secondary concern. For any application where quality, cost, or latency matters at scale, it is a first-class design problem — as important as your prompt engineering and your retrieval strategy. It just gets less attention because it fails silently.

Back to top ↑

References & Further Reading

Research Papers

Technical References

Background Reading