Why LLM Context Windows Fill Up Faster Than You Think
The token arithmetic behind every LLM conversation — what consumes space before the user types a word, why performance collapses before you hit the limit, and how to manage it.
You build something with GPT-4o. The model supports 128,000 tokens. You think: that's enough for a full novel. Then, four or five conversation turns in, the model starts forgetting things that were said earlier. Eight turns in, you hit an error. You check the token count — you've used over 100,000 tokens, and you've typed maybe 400 words.
This isn't a bug. It's the predictable consequence of not accounting for where those tokens actually go. A context window isn't blank space waiting to be filled with your words. By the time the first user message arrives, it is already partially consumed — by system instructions, by tool definitions, by retrieved documents, by the tokens the model itself generated in earlier turns. In a production AI agent, 30–60% of the context window is gone before a user types anything.
What follows is a precise accounting of where those tokens go — the four layers that consume the window before users say anything, why the effective limit is substantially lower than the advertised one, what happens to response quality as the window approaches capacity, and which engineering patterns actually manage it at production scale.
The Illusion of Abundance
GPT-4o supports 128K tokens. Claude 3.5 supports 200K. Gemini 1.5 Pro has been demonstrated at a million tokens — roughly 750,000 words, about ten average novels. The numbers sound absurdly generous. How could you possibly run out?
Start with a calibration exercise. What is 128,000 tokens, actually?
In English prose, one token is roughly four characters — about three-quarters of a word. A 1,000-word article runs to around 1,300 tokens, so 128K tokens can hold close to 96,000 words of clean text. That genuinely is a lot.
But text in an LLM application is rarely clean English prose. It is JSON payloads from tool calls. It is API responses full of structured data. It is code. It is URLs. It is conversation history with speaker labels, timestamps, and formatting. All of these serialize into tokens at rates much higher than 4 characters per token, for reasons we will examine in the next section.
Then there is the question of performance. The advertised number represents a technical limit — the longest sequence the model can physically process. It does not represent the length at which the model operates at peak accuracy. Research has repeatedly found a significant gap between the two. Long-context benchmarks like RULER (2024) and HELMET (2024) found that in adversarial multi-document tasks, most frontier LLMs showed accuracy drops well before 32K tokens — GPT-4o fell from near-perfect baseline scores to the high-60s percentage range at 32K in some configurations. The technical limit says 128K. The accuracy cliff arrives much earlier.
Models claiming 200K context windows show measurable quality degradation around 130K tokens in practice. Treating the advertised number as your operating budget is how production systems quietly degrade without triggering any explicit error.
Cost is the third angle. Every token in the context is a token billed. At GPT-4o's pricing, 128K tokens of input costs several dollars per call — and agents often make dozens of calls per session, each with the full accumulated context. The monthly bill from a badly-managed context window can surprise you well before any error appears in the logs. This is a quality problem, a reliability problem, and a cost problem, all from the same root cause.
The root cause is the tokenizer — and specifically how it represents the kinds of structured, non-prose content that dominate real application payloads.
How Tokens Are Counted — and Why the Count Surprises You
An LLM does not read text. It reads a sequence of integers. Before any word reaches the model, it passes through a tokenizer that converts characters into integer IDs from a vocabulary of roughly 50,000–200,000 entries. The tokenizer used by GPT-4 and GPT-4o is called cl100k_base; it has about 100,000 vocabulary entries. OpenAI's newer models use o200k_base, with about 200,000.
The vocabulary is built using BPE — Byte Pair Encoding. The name comes from the construction: you start with individual characters, then repeatedly merge the pair of adjacent symbols that appears most often in your training corpus, replacing each pair with a new combined token. Do this enough times and common English words end up as single tokens. The algorithm learns what to merge entirely from what was common in the training text — mostly English prose on the internet. That's why "the", "is", "running" each become a single token, while "tokenization" becomes ["token", "ization"] (two pieces — less common as a whole word). Characters and raw bytes are the fallback for anything the vocabulary doesn't cover. The consequence is simple: anything that wasn't well-represented in training data — JSON brackets, URL slashes, code indentation — never got merged aggressively, so those sequences remain expensive in tokens relative to the characters they contain.
The rule-of-thumb is 1 token ≈ 4 characters for clean English prose — decent enough for napkin estimates. But real LLM application payloads aren't prose. They're JSON, URLs, code, and mixed structured content, and those break the ratio badly.
Numbers tokenize unexpectedly
BPE learns tokens from frequency in training data. The number "2023" is common in training data — it became a single token. But "2026" is less common, and "19847" is rare — these get split into per-digit or per-pair tokens. Consider: the price "USD 1,234,567.89" is not 18 characters of dense data. It is approximately 10–12 tokens, because the commas, period, digits, and currency symbol may each claim separate tokens depending on their adjacency in the vocabulary.
URLs are disproportionately expensive
A URL like https://api.example.com/v2/users/12345 looks compact — 38 characters, which by the prose rule should be about 9–10 tokens. In practice it is closer to 15–20 tokens. Slashes, dots, hyphens, underscores, and alphanumeric path segments each claim their own tokens or merge into small fragments — but the fragments are small because URLs are structurally uncommon in prose, so BPE never learned to merge them aggressively.
JSON and structured data use roughly 2x the token count of plain text
Consider two ways to express the same fact:
Plain text: The user's name is Alice, she is 28 years old, and her account is active.
JSON: {"user": {"name": "Alice", "age": 28, "status": "active"}}
The plain text version: approximately 18 tokens. The JSON version: approximately 22 tokens — and this is a trivially small object. Real API responses with deeply nested keys, repeated field names, and verbose formatting can be far more expensive. Every brace, colon, and comma is a token or part of a token. A 500-word JSON payload can use 800+ tokens, well above what the character count would suggest.
Code tokenizes inefficiently in some languages
Research found that Python uses roughly 46% more tokens than equivalent Haskell to express the same computational idea. This is partly because Python's indentation-based structure requires whitespace tokens, and partly because Python's identifiers and keywords were less densely represented in the pre-GPT-4 training corpora.
Think of the context window as checked baggage with a weight limit, not a size limit. A suitcase full of dense sweaters weighs less than one with foam packing material filling the same volume. Plain prose is the dense sweaters — you pack a lot of meaning into few tokens. JSON, URLs, and code are the foam — structurally bulky, meaning-sparse, yet they count toward the same limit.
The interactive demo below lets you see this directly. Type or paste different kinds of text and watch the token count change with the type of content — observe how the ratio shifts when you enter a URL, a JSON snippet, or a block of code compared with an equivalent sentence of English prose.
The Four Layers That Eat Your Context Window
Every LLM API call is a full context payload assembled from four distinct layers. Most developers think about only one: the user's current message. The other three arrive already loaded — silent costs that accumulate before the user types anything.
Layer 1: The System Prompt
The system prompt is the foundational layer. It is always present, on every API call. A minimal system prompt — "You are a helpful assistant" — costs about 7 tokens. But real production system prompts are not minimal.
A typical customer-facing chatbot system prompt contains: the model's persona and tone guidelines (3–5 paragraphs), a list of topics it should and should not address, instructions about response format, any domain-specific knowledge baked in as facts, legal disclaimers about the model's limitations, and formatting instructions for different response types. Measured in practice, these range from 800 to 2,500 tokens. They are charged on every single API call. A 1,500-token system prompt running 1,000 calls per day costs you 1.5 million input tokens per day before a user says anything.
Layer 2: Tool Schemas
When you give an LLM access to external tools — a web search function, a database query, a calculator, an API call — you must describe each tool to the model in the context window. The description tells the model what the tool does, what arguments it accepts, and what each argument means.
These descriptions are written in JSON and can be verbose. A moderately documented tool schema looks like:
{
"name": "search_documents",
"description": "Searches the internal knowledge base for documents matching the query. Returns a ranked list of excerpts with source URLs.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query. Should be a concise natural language question."
},
"max_results": {
"type": "integer",
"description": "Maximum number of results to return. Default 5, max 20."
},
"filter_date": {
"type": "string",
"description": "ISO 8601 date string. If provided, only returns documents published after this date."
}
},
"required": ["query"]
}
}
This single schema: roughly 200 tokens. An agent with five tools carries around 1,000 tokens of tool descriptions on every call, before any user input. The JSON structure alone — all those braces, colons, and quoted keys — is part of why the token cost is higher than reading the description would suggest.
System prompt + tool schemas is your fixed cost floor. It doesn't change turn-to-turn. It can easily reach 2,000–4,000 tokens in a real agent — that's 1.5–3% of a 128K window before the conversation begins. Small fractions, yes, but they multiply across every single API call in your fleet.
Layer 3: Retrieved Context (RAG)
Many production LLM applications use RAG: retrieving relevant documents from a database and injecting them into the context window as supporting material for the model's answer. This is one of the largest and most variable token consumers.
A typical RAG retrieval returns 3–8 document chunks. Each chunk is typically 300–600 tokens, chosen to be the right size for embedding and retrieval. Three chunks at 400 tokens each: 1,200 tokens. Eight chunks at 500 tokens each: 4,000 tokens. In a research assistant with a generous retrieval budget, you might inject 8,000–12,000 tokens of context per query. That's 6–10% of a 128K window, every turn.
The retrieved content itself is usually prose, so tokenization is efficient. But if your knowledge base contains code, JSON configuration files, or API documentation, retrieved chunks will tokenize at the worse ratios we saw in Section 2.
Layer 4: Conversation History
This is the layer people think about the least and the one that creates the most painful surprises. In an LLM chat system, the model has no persistent memory. It does not remember previous turns. You create the illusion of memory by re-sending the full conversation history on every API call.
Turn 1: you send [system] + [user message 1]. The model replies. Turn 2: you send [system] + [user message 1] + [assistant reply 1] + [user message 2]. Every turn appends two new entries (a user message and a model response) to a history that is re-sent in its entirety.
Model responses, crucially, can be long. A detailed answer with a code snippet might be 600–800 tokens. An explanation with three bullet points might be 400 tokens. After ten exchanges, the conversation history alone can be 8,000–12,000 tokens, plus the fixed costs from layers 1 and 2, plus RAG context — all before the user types their next message.
Context Creep — Watching the Window Fill
The process by which a context window fills over a conversation has a name in production systems: context creep. It is not sudden. It is a steady, per-turn accumulation driven primarily by the growing history layer.
The chart below traces a realistic conversation in a customer support agent. The agent has: a 1,200-token system prompt, three tool schemas totaling 600 tokens, and RAG retrieval returning two chunks (~800 tokens per turn). User messages average 60 tokens; model responses average 350 tokens.
With default parameters (1,200-token system prompt, 600 tool tokens, 800 RAG per turn, 60-token user messages, 350-token model replies), the window hits 80% full around turn 84. Change the model reply length to 800 tokens — simulating an agent that produces detailed answers — and that drops to around turn 61, because the per-turn history growth doubles. Watch how the window fills as turns accumulate. The fixed layers (system + tools) appear immediately. The RAG context adds a constant per-turn cost. The history layer grows relentlessly:
KV Cache Memory — Why Context Has a Physical Cost
The context window limit is not an arbitrary policy. It is enforced by physics — specifically, GPU memory. To understand why, we need to understand what happens computationally when a model processes a long context.
The transformer's attention mechanism works by comparing every token in the context with every other token. Think of it like a search: for each token, the model creates a query ("what am I looking for?"), and every other token offers a key ("what do I contain?"). The similarity between a query and a key determines how much attention flows from one token to another. A third vector — the value — carries the actual information that gets passed when attention is high. In practice, for every token in the input, the model computes three vectors: a query q, a key k, and a value v. Assembled across all tokens, these become the matrices Q, K, and V.
Attention(Q, K, V) = softmax(QKT / √dk) · V
Where Q, K, V are query, key, value matrices, and dk is the key dimension. Plain-English version: "For each token, compare it against every other token, weight the results, and blend the values accordingly." The QKT product is the expensive part: it is an n × n matrix where n is the sequence length. Doubling n quadruples this computation. This is the prefill phase — processing the full input prompt — and it is inherently O(n²) per layer.
There are two distinct computational phases in LLM inference. Prefill processes the entire input prompt at once — this is O(n²) per attention layer, and for long prompts it is the dominant compute cost. In practice, implementations like FlashAttention reduce the memory bandwidth pressure dramatically via tiled computation, but the asymptotic complexity doesn't change: prefill scales quadratically with sequence length. Decode generates one token at a time, attending only to the current token against the cached history — this is O(n) per step with the KV cache. Without caching, decode would also be O(n²) because you'd recompute the full history on every step. The KV cache converts decode from O(n²) to O(n) at the cost of memory.
The KV cache solves the compute problem: it stores the K and V tensors for all previously processed tokens so they can be reused without recomputation. Generation becomes O(n) instead of O(n²) per step.
But caching has a cost: memory. Every token's K and V tensors must be stored in GPU VRAM for the duration of the conversation. The memory requirement grows linearly with context length:
KV_memory = 2 × n_layers × n_heads × d_head × seq_len × bytes_per_param
For a 7B-parameter model using standard MHA (32 layers, 32 heads, head_dim 128) at bfloat16 (2 bytes):
KV_memory per token ≈ 2 × 32 × 32 × 128 × 1 × 2 = 524,288 bytes ≈ 0.5 MB (seq_len = 1 for per-token cost)
At 128K context: 0.5 MB × 128,000 = 64 GB of KV cache alone.
Note on GQA: Most modern models (Llama 3, Mistral, GPT-4o) use GQA (Grouped-Query Attention), which reduces the KV cache by sharing key-value heads across groups of query heads. A model with 32 query heads and 8 KV heads (4× reduction) brings the per-token cache cost from 0.5 MB to ~0.125 MB — about 16 GB at 128K context. Significant, but still the dominant memory consumer at long contexts. DeepSeek and related models use MLA (Multi-head Latent Attention), which compresses the KV cache via low-rank projection and achieves 5–10× memory reduction over standard MHA. If you're running inference on DeepSeek-class models, the KV cache numbers above are significantly smaller in practice.
This is why long-context inference is expensive: not because computation is expensive (the KV cache handles that), but because memory is expensive and finite. At 128K context, the KV cache for a 7B MHA model already consumes 64 GB — more than the model weights at bfloat16 (~14 GB). GQA-equipped models do better, but the cache still dominates at long contexts. Getting to 1M tokens requires either multi-GPU setups, aggressive quantization of the KV cache, or offloading it to slower memory tiers.
Larger models have larger caches. A 70B model with MHA (80 layers, 64 heads, head_dim 128, bfloat16) comes to roughly 2.5 MB per token: 2 × 80 × 64 × 128 × 2 bytes = 2,621,440 bytes. At 128K context that's ~320 GB — which is why providers either cap context length aggressively for large models, or charge steeply for long-context calls. GQA with 8 KV heads drops it to ~40 GB, which is still substantial.
Each additional token in the context doesn't just add computation — it adds permanent VRAM occupancy for the duration of the session. Once tokens are in the KV cache, they can't be removed without discarding the cache and reprocessing the context from scratch. This is why context management isn't optional — it's a direct lever on infrastructure cost.
Prompt caching: the production shortcut for fixed costs
OpenAI, Anthropic, and Google all offer prompt caching: if you send the same system prompt (or any long prefix) repeatedly, the provider caches the computed KV activations on their servers. Subsequent calls that begin with the same prefix pay a reduced per-token price for those cached tokens (typically 50–75% cheaper) and benefit from lower latency because the prefill phase for the cached portion is skipped.
This has a direct implication for the fixed-cost problem: a stable system prompt with thousands of tokens, sent on every API call, is an ideal prompt caching candidate. The first call computes and caches the KV activations for the system prompt; every subsequent call reuses them. Providers automatically manage the cache — you don't implement anything special, you just ensure the cacheable prefix is identical across calls (same characters, no dynamic content injected into the system prompt). One practical constraint: both OpenAI and Anthropic require a minimum prefix length of at least 1,024 tokens before caching activates. A 200-token system prompt won't benefit — another reason to consolidate your system prompt into one substantial block rather than spreading instructions across multiple small messages.
KV cache quantization: the emerging workaround
One active area of production optimization is quantizing the KV cache — storing the K and V tensors in lower-precision formats (int8 or int4) rather than float16. This can cut KV cache memory by 2–4x with modest accuracy penalties. Research like KVQuant explores going to 2-bit precision for certain layers while keeping others at higher precision, targeting 10M token contexts on commodity hardware. The tradeoff is that quantization introduces small errors into the attention computation — tolerable for most use cases, problematic for tasks requiring precise recall of specific facts in long documents.
Lost in the Middle — Why Performance Collapses Before You Hit the Limit
Memory is the first constraint. Attention quality is the second — and it bites you even when your window is half-empty. A model with a 200K context window and plenty of headroom can still fail to use information that's nominally "in context," because not all positions in the context get equal attention.
In 2023, researchers at Stanford and UC Berkeley published a study titled "Lost in the Middle." They gave LLMs a task requiring them to find a specific document from a set of twenty documents, all injected into the context window. The position of the relevant document was varied systematically: sometimes at the beginning, sometimes at the end, sometimes buried in the middle.
The result was stark. When the relevant document was first or last, models retrieved it accurately. When it was in the middle positions, accuracy dropped by more than 30%. The longer the context, the worse the effect. Newer models — particularly those fine-tuned on long-context data like Claude 3.5 and GPT-4o — have partially mitigated this bias through training. "Partially" is doing a lot of work there: independent evaluations continue to find meaningful position-dependent performance gaps in all current models, even at lengths well within their advertised limits. Don't assume your model is immune — test it with your specific content placement before shipping.
Imagine a professor delivers a two-hour lecture. Students reliably remember the opening — they're fresh, attentive, taking notes. They remember the closing — the summary, the final emphasis, the exit. What happened in the middle of hour one? It's murky. The professor said things, but the attention curve dipped. LLMs have an analogous concentration pattern: strong attention to the beginning and end of the context, with a trough in the middle.
The mechanism is structural. The model needs to know where each token sits in the sequence — "word 1" versus "word 10,000" — because position affects meaning. Modern transformers encode this using RoPE (Rotary Position Embedding): position is encoded as a rotation applied to the query and key vectors before the similarity comparison. The mathematical property of this rotation is that the similarity score between two vectors naturally decreases as the distance between their positions increases — the further apart two tokens are, the less the model tends to connect them. This is intentional: nearby words are usually more relevant to each other than distant ones. At short contexts, the decay is a feature. At long contexts, it becomes a bug: tokens in the middle of a 100K-token window are thousands of positions away from both the beginning and from where the model is currently generating, so their similarity scores are systematically suppressed.
The practical implication: even when content fits in the window, placement matters. Information you want the model to reliably use should be near the beginning or near the end of the context. Information buried thousands of tokens deep in the middle is at risk of being effectively invisible.
If your RAG system retrieves 8 documents and inserts them in the middle of a long conversation history, the most relevant chunks may be in the attention trough. The model generates a response, you see no error, but the answer doesn't reflect those documents. The failure is silent — the model doesn't say "I couldn't read that." It just uses what it attended to, which may be the surrounding conversation history instead.
Follow-up research ("Found in the Middle," 2024) proposed architectural mitigations that calibrate the positional bias during fine-tuning, recovering up to 15 percentage points of accuracy. But these require modifying or fine-tuning the model — they're not available as a prompt-engineering trick. What you can do in practice is discussed in the strategies section.
Context length alone hurts performance even with perfect retrieval
A 2025 study ("Context Length Alone Hurts LLM Performance Despite Perfect Retrieval") confirmed something practitioners had suspected but lacked clean data on: context length degrades performance even when retrieval is perfect — when the relevant information is guaranteed to be present. The longer the surrounding irrelevant context, the more the model's attention distributes across noise, leaving less effective focus on the signal. This is called context dilution. Finding one red marble in a bag of ten is easy; finding it in a bag of ten thousand is not — even if you know it's there.
The interactive demo below lets you explore the relationship between context position and attention reliability, using a simplified attention visualization based on the U-shaped curve observed in research.
Token Budget Math — Calculating Your Real Available Space
The prerequisite for managing a context window is knowing precisely where the tokens are going. This requires building a budget — an explicit accounting of every zone in the context and how many tokens it receives.
Think of the context window as a table with five rows:
| Zone | What it contains | Typical token range | Fixed or variable? |
|---|---|---|---|
| System Prompt | Instructions, persona, rules, domain knowledge | 500–2,500 | Fixed per application |
| Tool Schemas | JSON descriptions of available functions/tools | 200–400 per tool | Fixed per agent |
| RAG Context | Retrieved document chunks injected before the query | 0–12,000 | Variable per turn |
| Conversation History | All prior user messages and assistant responses | 0 → grows | Grows each turn |
| Generation Reserve | Space for the model's next response | 500–2,000 | Reserved explicitly |
The total of all five zones must be less than your context limit. The generation reserve must be reserved explicitly — it is not "used up" by input, but if your prompt consumes the entire window, the model either generates nothing or truncates its response.
A worked example. Application: a customer support agent. Model: GPT-4o (128K context).
Context budget allocation:
Total window: 128,000 tokens
System prompt: -1,400 tokens (measured)
Tool schemas (4 tools): -800 tokens (measured)
Generation reserve: -1,500 tokens (set by us)
─────────────────────────────────────────
Available for dynamic: 124,300 tokens
Of that:
RAG budget: 20,000 tokens (5 chunks × 4,000 avg)
History budget: ~104,300 tokens (fills over time)
─────────────────────────────────────────
Turns until 80% full:
80% of 128K = 102,400 prompt tokens
Fixed overhead = 1,400 + 800 = 2,200
Per-turn RAG = 800
Per-turn growth = user avg (60) + model avg (350) = 410
Turns until (2,200 + n × 800 + n × 410) ≥ 102,400
n × 1,210 ≥ 100,200
n ≈ 82 turns
82 turns sounds comfortable. But this assumes constant 350-token model replies. A user who triggers several detailed answers — tables, code snippets, bullet-heavy explanations — can double the history growth rate, cutting that to ~41 turns before the 80% degradation threshold.
The system prompt and tool schema token counts must be measured with the actual tokenizer — not estimated from character counts. The OpenAI Python SDK exposes a usage field on every API response with prompt_tokens and completion_tokens. Log these from day one. The distribution of prompt_tokens over time is your context growth curve.
The interactive budget allocator
The demo below lets you experiment with your own token budget. Adjust each zone and see the distribution change. The goal: understand what fraction of your window each layer consumes on a typical call, so that when you need to cut, you know which lever to pull.
Four Strategies for Managing Context Window Limits
At some point in a long-running application, the history will outgrow the window. The engineering question isn't whether that happens — it's whether your system degrades gracefully or crashes. These four strategies address it at different points in the pipeline, with very different implementation costs.
Strategy 1: Sliding Window
The sliding window is the simplest and most commonly deployed strategy. You set a maximum history length — say, the last 20 conversation turns — and enforce it by truncating older turns when the limit is reached. The 21st turn in means the 1st turn out.
Implementation is straightforward:
# Turn-count version — simple, good enough for prototyping
MAX_HISTORY_TURNS = 20 # configurable
def build_messages(system_prompt, history, new_message, rag_chunks):
# Enforce sliding window on history
trimmed_history = history[-MAX_HISTORY_TURNS:]
# Assemble context
messages = [{"role": "system", "content": system_prompt}]
if rag_chunks:
context_block = "\n\n".join(rag_chunks)
messages.append({"role": "system", "content": f"Context:\n{context_block}"})
messages.extend(trimmed_history)
messages.append({"role": "user", "content": new_message})
return messages
# Production version — truncate by token count, not turn count
# count_tokens() wraps tiktoken.encoding_for_model("gpt-4o").encode()
# HISTORY_TOKEN_BUDGET = context_limit - fixed_costs - generation_reserve
# Example for 128K window: 128000 - 2200 (sys+tools) - 1500 (reserve) - 20000 (RAG) ≈ 104000
HISTORY_TOKEN_BUDGET = 40_000 # adjust for your application
def build_messages_token_bounded(system_prompt, history, new_message, rag_chunks):
fixed_tokens = count_tokens(system_prompt) + sum(count_tokens(c) for c in rag_chunks)
new_msg_tokens = count_tokens(new_message)
remaining = HISTORY_TOKEN_BUDGET - fixed_tokens - new_msg_tokens
# Walk history from newest to oldest, keep what fits
# Collect in reverse, then reverse at the end (avoids O(n²) insert(0,...))
trimmed_rev = []
for turn in reversed(history):
turn_tokens = count_tokens(turn["content"])
if remaining - turn_tokens < 0:
break
trimmed_rev.append(turn)
remaining -= turn_tokens
trimmed = list(reversed(trimmed_rev))
messages = [{"role": "system", "content": system_prompt}]
if rag_chunks:
messages.append({"role": "system", "content": "Context:\n" + "\n\n".join(rag_chunks)})
messages.extend(trimmed)
messages.append({"role": "user", "content": new_message})
return messages
The turn-count version is fine for prototypes. In production, a single verbose model response (say, 1,200 tokens) can be worth four normal turns by count, so turn-count truncation gives you a wildly inconsistent budget. Measure prompt_tokens from the previous API response and truncate from the oldest end when the projected total exceeds your budget.
The drawback of the sliding window is abrupt forgetting: when turn 1 drops, any fact established in that turn is gone. If the user's name was introduced in the first message, the model will forget it. For short-lived task-completion agents, this is fine. For long-running conversational assistants, it creates visible gaps.
Strategy 2: Hierarchical Summarization
Hierarchical summarization preserves the spirit of old context without its token weight. The architecture keeps a buffer of the last K verbatim turns, plus a rolling summary of all turns before that. When the buffer exceeds K turns, the oldest turn is compressed into the summary.
async def maybe_compress_history(history, summary, buffer_size=10):
verbatim_turns = history[-buffer_size:]
turns_to_summarize = history[:-buffer_size]
if not turns_to_summarize:
return history, summary
# Compress older turns into summary
new_summary = await llm.complete(
f"Existing summary: {summary}\n\n"
f"New exchanges to incorporate:\n{format_turns(turns_to_summarize)}\n\n"
"Update the summary to include these exchanges. "
"Preserve all concrete facts, decisions, and commitments. "
"Drop conversational filler. Be dense. Max ~400 tokens."
)
return verbatim_turns, new_summary
The cost: one additional LLM call per compression cycle. If your buffer size is 10 turns, you trigger compression every 10 turns — a minor overhead in most applications. The summary itself should be capped at a fixed token budget (200–400 tokens) so it doesn't become its own creep problem.
A critical implementation detail: the summary model call must be non-blocking with respect to the user's message. Don't make the user wait for summarization. Run it asynchronously, or trigger it after the current turn's response has been sent.
Strategy 3: Token Compression with LLMLingua
LLMLingua and its successors (LLMLingua-2, LongLLMLingua) approach the compression problem at the token level rather than the content level. Instead of summarizing, they identify individual tokens in the prompt that carry low information content — tokens the model could predict easily from context — and remove them.
The result is a compressed prompt that looks slightly garbled to a human but retains its semantic content for the LLM. Research benchmarks show 2–3× compression with accuracy loss under 5% — but that figure is measured on information-retrieval and general QA benchmarks. On tasks requiring exact recall (code, precise numbers, legal clauses), accuracy loss is higher and harder to predict. Use it where meaning matters more than precision. The most effective targets are:
- Verbose system prompts — lengthy instructions with redundant phrases and hedging language
- RAG context chunks — retrieved passages often contain boilerplate text surrounding the key facts
- Few-shot examples — demonstration examples can often be compressed without losing the pattern they demonstrate
Token compression is not suitable for content where every word matters: code that will be executed, legal clauses, precise numbers, or any content where errors compound. Research benchmarks showing under-5% accuracy loss are measured on information-retrieval and general QA tasks with ample context — for tasks where precision is critical, test compression in your specific domain before deploying it. It should not be applied to user messages — compressing user input changes the user's meaning before the model sees it.
LLMLingua operates on the entire prompt. If you run compression on a prompt that includes the user's current message, you may alter what the user said before the model sees it. Apply compression only to system, tool, RAG, and history zones — never to the current user turn.
Strategy 4: Embedding-based Retrieval Over History
This is the most architecturally sophisticated strategy and the most powerful for long-running agents. Rather than keeping a sliding window or a summary, you store every conversation turn as a dense vector embedding. At each new turn, you embed the current user message and run similarity search over the history embeddings to retrieve the most relevant prior exchanges.
The effect: only the conversation history that is relevant to the current question enters the context window. A user asking "what was the budget we discussed?" triggers retrieval of the turns where budget was mentioned — even if those turns happened fifty exchanges ago and would have been dropped by a sliding window.
This is Conversational RAG — the same mechanism as document RAG, applied to the conversation itself instead of a knowledge base. Here is how it works concretely: as each conversation turn completes, you embed that turn (the user's message plus the assistant's response, concatenated) into a dense vector and store it alongside the turn's full text. When the next user message arrives, you embed it and run a nearest-neighbor search over all stored turn embeddings. The top-k turns by similarity — maybe 3–5 — are retrieved as text and injected into the context window, alongside only the last 2–3 verbatim turns for coherence. All other history stays in the vector store, never entering the context.
The retrieval logic mirrors exactly what a document RAG system does for knowledge retrieval, except your "documents" are conversation turns. This requires:
- An embedding model to vectorize turns (e.g. text-embedding-3-small)
- A vector store for history embeddings (in-memory for short sessions; persistent across sessions for long-running assistants)
- A retrieval call per user message to find the top-k relevant history turns before building the context
The tradeoff is complexity and latency. Each turn requires an embedding call and a vector search before building the context. With a managed embedding API (OpenAI, Voyage, Cohere), that adds roughly 50–150ms round-trip. With a self-hosted embedding model, it can be under 10ms. For most chat applications, either is fine. For real-time voice, where end-to-end latency budgets are tight, you'd want a local model. The strategies aren't mutually exclusive: production systems often combine them — a sliding window of 5–8 verbatim turns, a rolling summary for the next tier back, and embedding retrieval for anything older. That covers all three distance scales in one architecture.
Embedding-based retrieval over history has an exact analogue in AI coding tools. When an AI code assistant spends a session researching a codebase — tracing call graphs, locating edge cases, identifying key file paths — that knowledge exists only in the session's context window. When the session ends, the knowledge is gone. The next session re-discovers everything from scratch, re-reading the same files and re-grepping the same symbols.
A benchmark across the Apache Camel codebase (5,856 files) measured this re-discovery cost directly. Without session memory, a second-session implementation agent made 51 tool calls to rebuild its mental model — and still failed on one task, producing 0 bytes. With structured notes stored via vectr_remember during the research session and retrieved via vectr_recall at the start of the implementation session, the same tasks completed at −40% input tokens and −58% cost. The mechanism is the same as Conversational RAG: store context as structured notes, retrieve only what is needed, avoid re-computing what is already known.
The Practical Playbook
Four strategies, but which one for your application? That depends on your conversation length, your RAG budget, and your tolerance for implementation complexity. Here's the decision logic for the most common cases.
Short task-completion agents (under 20 turns expected)
Use a sliding window of the last 10–15 turns. The task will complete before history becomes a problem. Reserve your optimization effort for fixed-cost reduction: measure your system prompt and tool schemas, remove redundant language, consider whether all tools need to be registered for every call or if you can load tools dynamically based on the current turn's context.
Long-running conversational assistants
Implement hierarchical summarization. Keep 8–12 verbatim turns. Run summarization asynchronously when history exceeds 15–20 turns. Cap summaries at 400 tokens. This keeps the history zone bounded while preserving continuity. Combine with a fixed-cost audit: every hundred turns or so, re-measure your system prompt token count. Prompt creep — the tendency of system prompts to accumulate new instructions, edge-case rules, and clarifications over time — is real and common. A prompt that started at 600 tokens can quietly grow to 3,000 across six months of product changes.
Document-heavy research assistants (heavy RAG use)
The RAG zone is your biggest variable. Three tactics: (a) limit retrieval to 3–5 top chunks rather than 8–10; (b) apply token compression to chunks before injection — a RAG chunk compressed 2× means twice as many sources can fit; (c) sort retrieved chunks so the most relevant appears first and last in the context (exploiting the primacy/recency attention bias), not in the middle.
Production AI agents with many tools
Dynamic tool registration reduces the tool schema overhead significantly. Instead of registering all tools upfront, analyze the current conversation turn to determine which tools are likely needed, and only include schemas for those. A routing classifier — even a simple keyword matcher — can reduce a 10-tool schema overhead from ~2,000 tokens to ~400 tokens on most turns. More sophisticated: use the model itself to route, with a cheaper, smaller model making the tool-selection decision before the main model call.
What to monitor in production
Three metrics to track on every API call:
- prompt_tokens / context_limit — the context utilization ratio. Alert above 70%; act above 80%.
- prompt_tokens by zone — instrument your context-building code to tag each zone's contribution. When total grows, you need to know which zone is responsible.
- Quality signals near high utilization — track user satisfaction scores, task completion rates, or whatever quality signal your application has, segmented by context utilization. You may find quality degrades measurably above 60% utilization in your specific application.
Treat the context window as a RAM budget, not a document store. RAM budgets are managed explicitly: you know what is loaded, you evict what is no longer needed, you measure what you're using, and you alert before you run out. Applying that same discipline to context windows is the difference between a system that degrades silently and one that remains reliable at scale.
Positioning for the "lost in the middle" effect
One architectural recommendation that costs nothing in tokens but requires discipline: order matters. When assembling your context, place the most important information at the positions of highest attention reliability — the beginning and the end. The default ordering used by most frameworks (LangChain, LlamaIndex) is: system → history → RAG chunks → user message. This is natural but suboptimal: the user's current question benefits from recency, but the most relevant retrieved chunks are buried in the middle between old history and the question.
A deliberate alternative: system → recent history (most-recent last) → RAG chunks (most relevant last, adjacent to the user message) → current user message. The most relevant RAG chunk and the current question sit adjacent at the end of the context, within the recency attention peak. The system prompt anchors the beginning. Older history — the least relevant content — occupies the lower-attention middle. This is a deviation from framework defaults and requires explicit control over message assembly, but it is supported on all major APIs through the messages array.
The Window Is a System Resource
A context window isn't a document store you fill until it overflows. It's a compute and memory resource with hard physical limits, a quality curve that degrades well before those limits, and an inference cost that grows with every token you put in it.
In a typical agent, the window is 30–60% consumed before the first user message lands. The fix isn't a bigger context window, though headroom helps. It's building a real budget: measure each zone with an actual tokenizer, set hard limits per zone, implement a context manager that enforces those limits on every call, and track utilization in production dashboards the same way you'd track memory or CPU.
The attention degradation problem — "lost in the middle" — adds a second dimension: even when your window is not full, quality depends on where in the window the important information sits. The primacy bias (models attend strongly to content at the start) and recency bias (strong attention to content at the end) are real, measurable effects that application design can exploit or fall victim to.
The four strategies aren't competitors — most production systems end up combining them. Sliding window for the recent turns, rolling summary for the older ones, compression for the RAG chunks, and retrieval for anything that needs to survive beyond the window. Start with the simplest thing that doesn't break your use case, and add layers as your traffic and conversation length grow. The one thing that definitely doesn't work is ignoring the problem until users start complaining.
Context engineering doesn't have the glamour of prompt engineering, but it's where most production LLM failures actually live. Missed retrievals, incoherent multi-turn conversations, bloated inference bills — these trace back to context mismanagement more often than they trace back to the wrong model. It fails silently, which is exactly why it's easy to ignore until you can't.
References & Further Reading
Research Papers
- Lost in the Middle: How Language Models Use Long Contexts
- Found in the Middle: Calibrating Positional Attention Bias
- Context Length Alone Hurts LLM Performance Despite Perfect Retrieval
- KVQuant: Towards 10M Context Length LLM Inference with KV Cache Quantization
Technical References
- OpenAI — Managing Conversation State
- Anthropic — Context Window Documentation
- LLMLingua — Prompt Compression
- KV Cache Memory: Calculating GPU Requirements for LLM Inference
Background Reading
- The Hidden Costs of Context: Managing Token Budgets in Production LLM Systems
- Context Window Management for LLM Apps: Developer Guide
- The Complete Guide to Text Embeddings, Vector Databases & LLMs