Why AI Code Assistants Waste Your Context Window — and How RAG Fixes It
A large context window does not mean a better context window. The math of attention explains why stuffing your whole codebase into a prompt is quietly destroying suggestion quality — and how retrieval-augmented generation fixes it by sending only the right code.
Open a large file in your AI code assistant and ask it to refactor a function buried three hundred lines down. Watch it confidently produce something plausible but wrong — using an interface that was deprecated last sprint, calling a helper that doesn't exist in this service, ignoring a constraint in the module-level docstring that it technically "saw." The model didn't forget. The information was technically present in the prompt, but the transformer's attention mechanism never meaningfully focused on it. That's a different kind of failure, and it doesn't get better with a bigger context window.
There's a persistent intuition in this industry that more context is always better. Send the whole file. Send the whole codebase. This intuition breaks in a specific and measurable way. The mechanism is called attention dilution — softmax normalization means that every token in the context competes for a fixed budget of attention weight, and as the sequence grows longer, any given piece of information gets a smaller share of that budget.
This post walks through the transformer attention math to explain exactly why the naive approach fails, then covers how RAG (Retrieval-Augmented Generation) addresses it — by retrieving only the specific code chunks relevant to the current task and injecting those into the context window instead of dumping everything.
The Naive Approach: Just Send Everything
The first instinct when building a code assistant is to send as much context as possible. Your project has a utility module? Include it. There's a shared type definitions file? Throw that in too. If the model's context window is 128,000 tokens, fill it to the brim — more information has to be better, right?
This is called context window stuffing. Three things go wrong with it, and each gets worse as the codebase grows. The first is attention dilution — the focus of this section. The second is position bias (Section 03). The third is raw cost (Section 04). To understand why these happen, you need a concrete model of how a transformer actually reads a prompt.
What the model actually receives
A transformer does not read a prompt sequentially, the way a human reads a page from left to right. Instead, it processes all tokens simultaneously, and every token attends to every other token in the sequence. The attention mechanism is the machine that computes how much each token should "look at" every other token when forming its representation.
The output of attention for a single token is a weighted average of all the other tokens' value vectors. The weights are computed by comparing the current token's query vector against every other token's key vector. When you add more tokens to the context, you are not adding more information to a receptive mind — you are adding more competitors for a fixed budget of attention weight.
Imagine you are in a room full of people, all talking at once. You can only pay 100 percent of your attention total — it does not grow with the number of people. With 5 people in the room, each gets roughly 20% of your focus. With 500, each gets 0.2%. When the relevant person finally says something, their share of your attention has collapsed to noise. That is what happens to code buried in a long prompt.
Why Attention Dilutes: The Math
The attention mechanism was introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017). Its core computation is:
Attention(Q, K, V) = softmax( QKT / √dk ) · V
Where:
Q — the query matrix (what each token is "asking for")
K — the key matrix (what each token "offers" for comparison)
V — the value matrix (the actual content passed forward if selected)
dk — the dimension of the key vectors (scales to prevent extreme dot products)
softmax — converts a vector of raw scores into a probability distribution that sums to 1
The notation QKT means: for each token, compute a dot product between its query vector and every other token's key vector. The dot product is large when two vectors point in the same direction (high relevance between the pair), and near zero when they are orthogonal (unrelated). Multiplying by the transposed key matrix KT does all N×N such comparisons in a single matrix operation. The result is a matrix of raw relevance scores. Dividing by √dk prevents those scores from becoming so large that softmax saturates (all weight on one token).
The softmax step is the dilution mechanism. Because softmax always outputs a probability distribution — all values sum to exactly 1 — attention weights are a zero-sum resource. When there are N tokens in the context, the average attention weight is 1/N, regardless of what any individual token does. The total budget is fixed at 1.0.
This does not mean every token gets exactly equal attention — the model can still concentrate on a small subset if the dot-product scores separate those tokens sharply from the rest. Softmax is non-linear and can be quite aggressive when there is a large score gap between relevant and irrelevant tokens. But in a real codebase, that gap is rarely clean. Hundreds of unrelated function definitions produce hundreds of tokens with moderately non-zero dot products — they're not completely irrelevant, they just aren't what you need right now. These tokens collectively consume most of the softmax budget. The useful signal — the few functions that actually matter — must compete against this crowd, and as N grows, the signal's share degrades continuously. It isn't a cliff; it's a steady erosion that compounds with each additional file you stuff in.
Softmax sharpening and temperature
The scaling factor √dk in the formula is there specifically to control how "peaky" the attention distribution is. Without it, dot products in high-dimensional spaces get very large, pushing softmax into saturation — where almost all weight goes to the single highest-scoring token. The scaling keeps the distribution smooth enough for gradients to flow during training. But the same property means that with a very long sequence, the model is doing softmax over hundreds of thousands of values. The distribution becomes nearly uniform across irrelevant tokens, and the useful signal struggles to dominate.
The context window limit is not just a practical engineering constraint — it reflects a genuine quality degradation. The problem is not that the model cannot read long inputs. It is that as context grows, every individual piece of information receives proportionally less attention weight. More input does not mean more comprehension; it means each fact competes harder for finite attentional resources.
Lost in the Middle: Position Bias
Attention dilution is one problem. A second, independent problem compounds it: position bias. Modern language models do not attend to all positions in their context with equal reliability. They preferentially attend to tokens at the beginning and end of the sequence, and perform significantly worse on information placed in the middle.
This phenomenon was studied in a 2023 paper by Nelson Liu et al. titled Lost in the Middle: How Language Models Use Long Contexts. The researchers tested models on multi-document question answering, varying the position of the document containing the answer. The results were sharp: when the answer document was at position 1 (the very beginning) or the last position, accuracy was high. When it was at position 10 of 20 documents, accuracy dropped by more than 30 percentage points — even though the information was technically within the model's context window.
Think of a book report written by someone who only reads the first chapter and the last chapter, and skims the middle. Everything in between is "in the book" — but it functionally does not influence the report. That is what happens to code you inject into the middle of a long prompt. It exists in the context. It is not reliably attended to.
Why does this happen architecturally?
Two mechanisms contribute. The first is RoPE (Rotary Position Embeddings), which is the positional encoding scheme in most modern open-source language models (LLaMA, Mistral, GPT-NeoX). RoPE encodes position by rotating the query and key vectors by angles proportional to their positions. The dot product between a query at position m and a key at position n includes a term that decays with relative distance (m−n) — semantically relevant tokens far from the query position must overcome a rotational penalty to receive attention weight. Tokens near the start of the sequence are close to almost every other position, giving them a structural advantage.
The second mechanism is causal training bias. Language models are trained to predict the next token given all previous tokens. This reward signal pushes models to weight recent tokens heavily — the immediately preceding context is almost always the most relevant signal for next-token prediction during training. The middle of a long context rarely dominated training gradients, so models systematically underweight it. This effect was documented in GPT-3.5 era models well before RoPE became standard — it isn't purely an artifact of positional encoding, it's baked into causal pretraining. Both effects run in the same direction: the middle of a long context is structurally disadvantaged.
A 2024 paper from the University of Washington, MIT, and Google (Found in the Middle) demonstrated that this bias can be partially corrected by calibrating attention weights at inference time — but this requires modifying the model's internals, something you cannot do when calling an API.
For practical purposes, if you are a user or a developer calling a commercial API, the position bias is a fixed environmental constraint. The implication is direct: you cannot mitigate it by stuffing more context. You can only mitigate it by ensuring the relevant content appears at the beginning of your context — which means retrieving it selectively rather than including everything and hoping.
Many teams inject retrieved chunks at the end of the prompt, after a long system prompt and conversation history, reasoning that "the model will see it just before generating." This lands retrieved content in a position that gets the worst of both worlds: far from the beginning (losing the primacy advantage) and not at the very end (which is reserved for the generation target itself). The safest placement for retrieved code context is immediately before the user's specific question, near the end but not buried in the middle of a long history.
The Quadratic Cost Problem
Even if you were willing to accept degraded attention quality, there is a third reason not to stuff context: the compute cost of attention scales quadratically with sequence length.
To compute the full attention matrix, the model must compare every token's query against every other token's key. If your sequence has N tokens, this requires N × N comparisons — producing an N² attention matrix. Doubling the context length quadruples the compute required for attention. This is not a limitation of current hardware; it is a mathematical property of the full-attention transformer architecture.
Time complexity of full self-attention: O(N²·d)
Where N = sequence length, d = model dimension.
Concretely: a 4× increase in context length → 16× increase in attention compute. A 10× increase → 100×. This is why inference latency grows non-linearly as context length grows.
FlashAttention (Dao et al., 2022) improves the memory profile to O(N) by tiling — it never writes the full N×N matrix to GPU DRAM. But the number of floating-point operations is still O(N²). Latency and cost still scale quadratically with sequence length.
In production, this means that a code assistant filling 100,000 tokens of context is not just 10× slower than one filling 10,000 tokens — it is closer to 100× more expensive in attention compute alone (before accounting for the rest of the forward pass). It is also 10× more expensive in the API pricing sense, since most APIs charge per input token. You are paying more to get worse results.
RAG at a Glance: The Core Idea
Retrieval-Augmented Generation reframes the problem. Instead of asking "how can we give the model the whole codebase?", it asks "how do we figure out which parts of the codebase are relevant to this specific completion request, and send only those?"
The answer has two phases. First, an offline indexing phase where the codebase is processed, divided into chunks, and each chunk is converted into a vector representation (an embedding) that captures its semantic meaning. These vectors are stored in an index optimized for fast similarity search. Second, an online retrieval phase that happens at query time: the developer's current context (cursor position, open file, recent changes) is converted into a query vector, and the most similar chunks from the index are retrieved and injected into the prompt.
The model then receives a context window that is not a random cross-section of the codebase — it is the small set of pieces most likely to be relevant to the task at hand. This keeps the context window small, focused, and full of signal rather than noise.
Steps 1–3 happen once (or on incremental file changes). Steps 4–6 happen on every completion request. The parts where most implementations go wrong: chunking (using fixed-size splits instead of AST boundaries), retrieval (using only dense search and missing exact identifier queries), and injection order (burying retrieved context in the middle of the prompt). Sections 06–09 cover each.
Chunking for Code: Why Fixed-Size Fails
The purpose of chunking is to divide the codebase into units that can be individually embedded and retrieved. The naive approach is fixed-size chunking: split every file every 256 tokens, regardless of where that lands in the code structure. It is easy to implement. It is almost always wrong for code.
Code has structure that text does not. A function is a unit of meaning. A class definition with its methods is a unit of meaning. Splitting in the middle of a function produces two chunks, each of which is semantically incomplete — one has the function signature and the early logic, the other has the return path. When embedded, neither chunk represents the function accurately. When retrieved, neither chunk gives the model what it needs to understand the function's contract.
Fixed-size chunking failure mode for code RAG
Consider a Python function that is 80 lines long. With a 50-token chunk size, it gets split into chunks that might look like:
def process_payment(order_id, amount,
currency="USD"):
"""Process a payment..."""
conn = get_db_connection()
try:
txn = conn.begin_transaction(
order_id=order_id,
amount=amount,
currency=currency
) except DatabaseError as e:
log_error(e)
raise PaymentError(str(e))
AST-based chunking
AST-based chunking uses a language parser — specifically tree-sitter — to parse each file and extract logical units. Instead of asking "where does the 256th token fall?", it asks "where are the function boundaries, class boundaries, and docstrings in this file?"
Each extracted unit becomes a chunk. A function definition — signature, body, docstring, and any inline comments — stays together. The chunk's metadata stores its file path, start line, end line, and the type of unit (function, class, method, module-level code). This metadata is as important as the chunk text itself: it tells the retrieval system where in the codebase this chunk lives, enabling it to surface related chunks from the same file.
def process_payment(order_id, amount, currency="USD"):
"""Process a payment for the given order.
Raises PaymentError on failure."""
...
# complete function body
class PaymentGateway:
"""Manages payment provider connections.
Thread-safe singleton."""
_instance = None
...
from payments.models import Order, Transaction
from payments.exceptions import PaymentError
from db import get_db_connection
AST chunking can produce variable-size chunks. A 10-line function and a 200-line class body are both valid chunks under this scheme. For very large functions or classes, a secondary split is sometimes applied — either at method boundaries within the class, or using a maximum token limit with the caveat that the split occurs at a statement boundary, never mid-expression. The right upper bound depends on the embedding model's context window (typically 512–8,192 tokens) and the target chunk granularity.
Sliding-window augmentation
One practical addition: after AST chunking, each chunk is sometimes augmented with a small surrounding context — the preceding import block, the class it belongs to, or the file's module-level docstring. This gives the embedding model enough context to produce a vector that reflects not just the chunk's local meaning but its role in the larger structure. The key is that this surrounding context is used only for embedding, not retrieved as part of the chunk text. The retrieved text remains the original chunk.
Sliding window overlap (copying N tokens from one chunk as the start of the next) is useful in prose documents where the narrative flows continuously. In code it often makes things worse: the overlap introduces duplicate logic into separate chunks, making embedding space crowded with near-identical vectors. For code, the recommended approach is to store a "parent context" chunk separately — always inject the enclosing class signature or module-level import block alongside any function chunk, rather than copying the previous function's body into the current chunk. The Continue open-source IDE extension uses this approach in its codebase indexing.
Retrieval Strategies: Dense, Sparse, and Why Code Needs Both
Once the codebase is indexed, retrieval is the step that determines which chunks are surfaced. There are two fundamentally different ways to search a corpus of text, and code needs both simultaneously.
Dense retrieval: semantic code search with embeddings
Dense retrieval converts both the query and each stored chunk into a numeric vector (an embedding), then finds chunks whose vectors are closest to the query vector by cosine similarity. This approach is powerful because it can match meaning even when the exact words differ. A query like "how do we handle rate limit errors?" will surface functions named throttle_on_429 or backoff_retry — code that addresses the same concern using completely different identifiers.
The embedding model used matters a great deal for code. General-purpose text embedding models (trained primarily on natural-language text) perform poorly on code because their training data underrepresents the structural and syntactic conventions of programming languages. Code-specialized models like voyage-code-3 — purpose-built for code retrieval — produce substantially better representations for function bodies, type signatures, and API calls than general models like text-embedding-3-large, which is a strong general-purpose embedding model but wasn't specifically designed around code.
Sparse retrieval: BM25 keyword search for exact identifiers
BM25 is a classical information retrieval algorithm that scores documents by how well they match the exact terms in a query. It does not understand meaning; it counts words. But for code, exact keyword matching is often exactly what you want.
If the developer is working on a bug involving PaymentGateway.process_refund, a dense embedding search might return several semantically related functions — but the exact function they need might not score highest on semantic similarity. A BM25 search for the exact string process_refund will find it immediately. Similarly, error codes, configuration key names, and exact API method names are better retrieved lexically than semantically.
PaymentGateway.process_refund — exact name matchRefundProcessor.execute — contains "refund"validate_refund_amount — contains "refund"reverse_charge — semantically related but no keyword matchreverse_charge — high semantic similaritycancel_transaction — semantically relatedPaymentGateway.process_refund — correct but ranked lowerREFUND_TIMEOUT_SEC = 30 — one-line constant; short chunk, sparse embeddingQuery: "process_refund implementation" — neither method alone captures all the relevant chunks.
For prose documents (product documentation, knowledge bases), dense retrieval typically outperforms BM25. For code, the opposite is often true on exact-identifier queries. Code has a high density of unique, domain-specific identifiers — function names, class names, constant names — that appear nowhere in the embedding model's training data and thus have poor semantic representations. For these, BM25 is reliably better. The right system runs both and combines the results.
Hybrid Search and Reciprocal Rank Fusion
Running both BM25 and dense retrieval solves the coverage problem — you surface both the exact-match candidates and the semantically similar ones. But now you have two ranked lists and need to combine them into a single ranked list to pass to the model. This is the rank fusion problem.
The challenge is that BM25 scores and cosine similarity scores live in completely different numerical ranges. BM25 produces scores that depend on corpus statistics (document length, term frequency, corpus size). Cosine similarity is bounded in [-1, 1]. You cannot add or average them directly in a meaningful way.
Reciprocal Rank Fusion (RRF)
RRF avoids the normalization problem entirely by ignoring raw scores and working only with ranks. The word "reciprocal" means 1/x — the score assigned to a document is the reciprocal of its rank in each list. A document ranked 1st gets a high score (1/61 with k=60); a document ranked 50th gets a low score (1/110). For each candidate chunk, the combined RRF score sums these reciprocals across all ranked lists:
RRF_score(d) = Σr ∈ R 1 / (k + rankr(d))
Where:
R — the set of ranked lists (e.g., {BM25 list, dense retrieval list})
rankr(d) — the position of document d in ranked list r (1-indexed)
k — a smoothing constant (typically 60) that prevents a single top rank from dominating. The value 60 was empirically validated as robust across many retrieval tasks in the original paper by Cormack, Clarke & Buettcher (2009). Increasing k makes the formula more conservative — it rewards consistent mid-rank appearances over a single strong rank.
If a document does not appear in a list, its contribution from that list is 0.
The effect of the formula: a document that ranks 1st in the BM25 list and 1st in the dense list will have an RRF score of 1/(60+1) + 1/(60+1) ≈ 0.033. A document that ranks 1st in one list but 100th in the other will score 1/(60+1) + 1/(60+100) ≈ 0.022. Consistently high-ranked documents across multiple sources float to the top. Documents that are strong in only one source get discounted. This is the behavior you want: candidates that both BM25 and semantic search agree on are the most reliably relevant.
When not to use RRF for code retrieval
RRF treats all ranked lists as equally authoritative. In practice you might know that for a particular query type — say, looking up a config constant by exact name — BM25 should dominate. Some teams implement weighted linear combination of normalized scores (min-max or z-score) instead of RRF, using a lightweight query classifier to set weights. BM25-heavy for exact identifier lookup; dense-heavy for conceptual queries like "how does this service handle retries." This adds real complexity and you should only go there if you have enough query traffic to measure the improvement. RRF with k=60 is a surprisingly robust default.
Reranking: The Final Sorting Pass
After hybrid search and RRF, you have a list of, say, 20 candidate chunks. You want to inject only the top 3–5 into the prompt. The question is whether the RRF ranking is good enough to trust for this final cut, or whether a second, more expensive sorting pass is worth running.
This is where cross-encoder rerankers come in. The retrieval methods used so far — both BM25 and dense embedding search — are bi-encoders: they encode the query and each document independently, then compare the resulting vectors. This is fast because the document embeddings are precomputed. But it means the query and document never interact during encoding — the model cannot see the query while deciding what aspects of the document to emphasize.
A cross-encoder takes both the query and a candidate chunk as a single concatenated input and produces a relevance score. Because both texts go through the model at the same time, the model can attend to relationships between the query and the document that a bi-encoder cannot. The accuracy improvement is often significant — but the cost is that you cannot precompute: the cross-encoder must run on every query-candidate pair at inference time.
The practical solution is a two-stage architecture: use the fast bi-encoder pipeline (dense + BM25 + RRF) to get the top 20 candidates, then run a cross-encoder on only those 20 to get a final ranking. The cross-encoder sees a small, manageable candidate set and produces a high-quality relevance ordering. The top 5 from this final ranking go into the prompt.
Cross-encoders are themselves transformer models with context window limits. A reranker fed a 2,000-token code chunk and a 500-token query is processing a 2,500-token combined input. General-purpose reranker models like ms-marco-MiniLM-L-12-v2 support 512 subword tokens — which is often enough for a single short function, but not for large class bodies or files. For retrieval pipelines that surface large chunks, use a reranker with a larger window: Cohere Rerank 3 supports 4,096 tokens, voyage-rerank-2 supports 16K. If the combined chunk+query still exceeds the limit, truncate the chunk from the bottom, not the top — the function signature and docstring are more informative for reranking than the implementation tail.
For general-purpose document retrieval, cross-encoder reranking consistently improves recall@5 by 10–20% over bi-encoder retrieval alone. For code, the benefit depends on how well your embedding model was trained on code. If you are using a strong code-specific embedding model with good AST chunking, the hybrid bi-encoder retrieval alone may be good enough for most queries. Reranking becomes most valuable when queries are ambiguous or when the codebase has many semantically similar functions that need fine-grained disambiguation. It adds latency (typically 50–200ms), so benchmark before committing.
How Cursor Does It: A Reference Architecture
Cursor is the most architecturally transparent commercial code assistant. Its codebase indexing implementation has been publicly described in enough detail to serve as a concrete reference for the pipeline above.
Indexing
When you open a project in Cursor, it begins indexing the codebase in the background. Files are chunked locally on your machine. Chunks are then sent to Cursor's servers, where they are passed through an embedding model — either OpenAI's embedding API or a custom-trained model, depending on the feature context. The resulting vectors are stored in Turbopuffer, Cursor's vector store of choice. Metadata — file path, line ranges, and a hash of the chunk content — is stored alongside each vector. File paths are obfuscated client-side before any data leaves your machine.
Cursor caches embeddings by chunk hash. The second time you index the same codebase (or if most files haven't changed), it skips recomputing embeddings for unchanged chunks, making incremental re-indexing fast.
Query signal construction
The query is not just "what the developer typed." Cursor monitors the active cursor position and constructs a composite signal from several sources: the current file's surrounding code at the cursor position, any open editor tabs (which it weights as likely-related files), and recent edit history. This composite signal is embedded into a query vector.
Retrieval and injection
The query vector is sent to Turbopuffer, which performs ANN search, returning the top-k most similar chunk vectors. Cursor's client receives the result with obfuscated file paths and line ranges, then reads the actual code from the local filesystem. The retrieved chunks are injected into the prompt sent to the LLM. The model never directly touches the vector database — it only sees the retrieved text.
When you type @Codebase in Cursor's chat, you're explicitly triggering the full RAG pipeline: embed the query, search the codebase index, retrieve top chunks, inject into prompt. Without @Codebase, Cursor uses a lighter heuristic — open tabs, recent edits, file imports — to construct context. The @Codebase symbol is the manual override that triggers a full retrieval pass over the whole indexed codebase rather than just what's visible in your editor. Two other retrieval sources extend the same pipeline beyond local code: @Docs (searches indexed documentation) and @Web (live web search).
One important architectural note: the embedding model used to index the codebase is separate from the generative model used to produce completions. Cursor uses a lightweight, fast embedding model for indexing (optimized for latency and throughput over millions of chunks) and a larger, slower generative model for the actual completion. When building a similar system, these two components have independent optimization concerns — do not assume the same model serves both roles.
GitHub Copilot's context window and retrieval approach
GitHub Copilot's context construction is less publicly documented but follows a similar pattern. For inline completion, it uses the current file content around the cursor (the prefix and suffix of the current file) plus a Jaccard similarity heuristic to find other open tabs that share significant token overlap with the current file. The @workspace symbol in VS Code triggers a more thorough indexing-based search of the workspace, analogous to Cursor's @Codebase.
The key distinction: Copilot's default inline completion mode is a fast, low-latency path that does not run full vector retrieval on every keystroke. Full retrieval is reserved for explicit chat interactions. This is a deliberate latency-accuracy tradeoff — inline completion needs to respond in under 100ms to feel responsive; full RAG retrieval adds 200–500ms of overhead.
Tradeoffs and Limits of Code RAG
RAG breaks in specific, predictable ways. If you're building a production system, you'll hit these eventually. Here's what to expect.
| Scenario | RAG Behavior | Mitigation |
|---|---|---|
| Cross-file dependency reasoning | Each retrieved chunk is a fragment. The model may not understand how three retrieved functions compose at the call site. | Include file path + line range metadata; retrieve parent class or module-level imports alongside function bodies. |
| Newly created files not yet indexed | Files added after indexing are invisible to retrieval until the index is rebuilt. | Incremental indexing on file-save events. Maintain a "pending index" queue that is flushed every N seconds. |
| Query is too vague | A query like "fix the bug" produces a query vector close to everything and nothing. Retrieved chunks will be generic. | Prompt the user to specify the symbol or file. Use the cursor position + surrounding error message as the primary query signal. |
| Minified or generated code | Minified JS, protobuf generated code, and lock files produce chunks with very low semantic density. They pollute the index and retrieve irrelevant noise. | Maintain a .gitignore-style ignore list for the RAG indexer. Exclude node_modules, *.pb.go, build directories. |
| Very large monorepos | Indexing millions of files takes significant time and storage. Retrieval recall degrades on very large indices unless the ANN index structure is tuned. | Scope the index to the subdirectory the developer is currently working in, or use per-service sub-indices with routing logic. |
| Schema / type changes | If a type changes but the embedding was computed from the old version, retrieved chunks may give the model an outdated type signature. | Invalidate embeddings on file write. Use chunk content hash to detect staleness and trigger recompute. |
Does a larger context window make RAG for code obsolete?
As context windows grow to 1M and beyond — Llama 4 Scout hit 10M tokens in 2025, Gemini 1.5 Pro supported 1M — people keep asking whether RAG becomes irrelevant at some point. Worth addressing directly.
The practical answer is no, though the reasoning matters. A 200,000-line Python codebase easily exceeds 2 million tokens. Most production monorepos are far larger. Context windows are growing, but so are codebases, and they're not converging. More importantly, the attention quality degradation described in Sections 02 and 03 doesn't disappear with a larger nominal window — a 2M-token context window doesn't deliver 2M tokens of equal-quality attention. Those long-context models achieve their range through techniques like sparse attention patterns and NTK-aware RoPE scaling, which help with extrapolation but don't eliminate the position bias at extremely long ranges. And practically: a 1M-token prompt is expensive and slow even on state-of-the-art hardware. For interactive code assistance that needs to respond within a second, stuffing the full codebase is off the table regardless of window size.
Large context windows and RAG do different jobs. RAG decides what deserves to be in the context window. The context window determines how much you can fit once you've been selective. A well-tuned system retrieves the right 5,000 tokens from a 10M-token codebase and puts them in a 128K window with room left for conversation history and tool outputs. That's the correct framing.
Building Your Own Code RAG Pipeline
If you are building a coding assistant, an internal developer tool, or a code-aware agent, here is a reference stack that covers the key decisions.
Parsing and chunking
Use tree-sitter with the appropriate grammar for each language. The tree-sitter-python, tree-sitter-typescript, and tree-sitter-go grammars are mature and battle-tested. Extract function definitions (function_definition node type in Python) and class definitions as primary chunk types. For functions longer than your embedding model's context limit, split at the statement level within the function body.
# Pseudo-code: AST chunk extraction with tree-sitter (v0.21+ API)
import tree_sitter_python as tspython
from tree_sitter import Language, Parser
PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)
TARGET_TYPES = {"function_definition", "class_definition"}
def walk_tree(node, source_code: str, file_path: str, chunks: list):
"""Recursively walk the AST to catch nested definitions
(methods inside classes, functions inside functions, etc.)"""
if node.type in TARGET_TYPES:
chunk_text = source_code[node.start_byte:node.end_byte]
chunks.append({
"text": chunk_text,
"file": file_path,
"start_line": node.start_point[0],
"end_line": node.end_point[0],
"type": node.type
})
# For class_definition, continue recursing to capture methods.
# For function_definition, stop — we want the whole function,
# not its nested helpers as separate chunks.
if node.type == "class_definition":
for child in node.children:
walk_tree(child, source_code, file_path, chunks)
else:
for child in node.children:
walk_tree(child, source_code, file_path, chunks)
def extract_chunks(source_code: str, file_path: str) -> list[dict]:
tree = parser.parse(source_code.encode())
chunks = []
walk_tree(tree.root_node, source_code, file_path, chunks)
return chunks
Embedding model selection
For a production code assistant, the embedding model choice matters significantly. Three strong options:
| Model | Context window | Strengths | When to use |
|---|---|---|---|
voyage-code-3 |
16K tokens | Purpose-built for code; top-ranked on code retrieval benchmarks (2025) | Production code assistant, maximum retrieval quality |
text-embedding-3-large |
8K tokens | Strong general performance; well-supported; large community | Mixed code + documentation retrieval; existing OpenAI integrations |
nomic-embed-code |
8K tokens | Open-weight; can run locally; no API cost | Air-gapped environments; cost-sensitive deployments; on-prem |
Vector store
For a single-developer or small-team tool: pgvector in a local Postgres instance is often sufficient. For a service handling multiple users: Qdrant is a strong choice — it supports both dense and sparse vectors in a single collection, enabling native hybrid search without maintaining two separate stores.
BM25 index
For pure BM25, Tantivy (the Rust-based full-text search library behind Qdrant's sparse vector support) or Elasticsearch's BM25 are the standard options. For a simpler deployment, the Python rank_bm25 library is adequate for corpora under ~50,000 chunks.
The prompt injection template
Placement and formatting of retrieved context matters. A template that works well in practice:
You are a coding assistant for this codebase.
## Relevant context from the codebase:
### [payments/gateway.py · lines 42–87]
```python
{chunk_1_text}
```
### [payments/exceptions.py · lines 1–24]
```python
{chunk_2_text}
```
### [payments/models.py · lines 88–112]
```python
{chunk_3_text}
```
## Current task:
{user_request}
Including file path and line numbers in each chunk header gives the model two useful signals: the module structure of the codebase, and the ability to reference specific locations in its response. These headers cost very few tokens but significantly improve the quality of generated code that needs to import from or reference the retrieved files.
It is tempting to inject 10–15 retrieved chunks to "give the model more information." Resist this. Each additional chunk increases the context size (paying the quadratic cost discussed in Section 4), increases the probability that the model attends to irrelevant material, and reduces the proportion of the context that is highly relevant. In practice, 3–5 high-quality chunks typically outperform 15 lower-quality ones. Invest in retrieval quality, not retrieval quantity.
The Through-Line
The surprising thing about attention dilution is that it isn't a bug you can patch. It's a structural property of softmax normalization — the total attention weight sums to 1.0 regardless of sequence length, so every token you add is competing with every other for a share of that budget. More context doesn't mean more understanding; it means each fact gets a smaller slice. The lost-in-the-middle position bias makes it worse: code injected into the middle of a long prompt is structurally disadvantaged by both RoPE's distance decay and the recency bias that causal pretraining instills. Knowing this changes how you think about the whole problem.
RAG doesn't solve attention dilution — it sidesteps it. Instead of sending everything and hoping the model finds what's relevant, it figures out what's relevant first and sends only that. The context window ends up containing what actually matters for the task: the right type definitions, the right helper functions, the right error handling patterns. The model has a real shot at using them.
In practice: below roughly 3,000–5,000 lines, context stuffing usually works well enough. You can fit the most relevant files in a 32K window and the model finds what it needs. Above that, the problems stack up fast. At 50,000+ lines, naive stuffing reliably hurts. At 500,000+ lines, it breaks down completely — the most relevant file is just one voice among hundreds. At that scale, AST chunking, hybrid BM25 + dense retrieval, RRF fusion, and careful prompt injection aren't premature optimization. They're the baseline.
Code RAG retrieves relevant chunks within a session. But when a session ends, the AI's working understanding of the codebase — which files matter, which edge cases were found, what is still missing — is gone. The next session re-discovers from scratch. A two-phase benchmark run against the Apache Camel codebase (5,856 files, unfamiliar enterprise Java) demonstrated this concretely: the vanilla agent spent 51 tool calls re-exploring in Phase 2 and produced 0 bytes on one task. The same task with cross-session memory (structured notes stored during Phase 1 research, recalled at Phase 2 start) produced a complete 5-file implementation at −58% Phase 2 cost. Code RAG and working memory are complementary: RAG retrieves the right code at query time; memory preserves what was learned across the session boundary.
References & Further Reading
- Attention Is All You Need — Vaswani et al., 2017.
- Lost in the Middle: How Language Models Use Long Contexts — Liu et al., 2023.
- Found in the Middle: Calibrating Positional Attention Bias — He et al., 2024.
- Retrieval-Augmented Code Generation: A Survey — 2025.
- RAG vs Large Context Window: Real Trade-offs for AI Apps — Redis Engineering Blog.
- Context Window Optimization: Why Ranking, Not Stuffing, Is the Scaling Law for Agents — Shaped AI.
- RAG for LLM Code Generation using AST-Based Chunking — Vishnudhat Natarajan, Medium.
- Better Retrieval Beats Better Models for Large Codebases — Stéphane Derosiaux.
- How Cursor Actually Indexes Your Codebase — Towards Data Science.
- How GitHub Copilot Works — Quastor Engineering.
- What is Retrieval-Augmented Generation? — GitHub Blog.
- Why Cursor, Claude Code, and Devin Use grep, Not Vectors — MindStudio.
- BM25 vs Dense Retrieval for RAG: What Actually Breaks in Production — Ranjan Kumar.
- Hybrid Search: BM25 and Dense Retrieval Combined — Michael Brenndoerfer.