Why AI Code Assistants Waste Your Context Window — and How RAG Fixes It
A large context window does not mean a better context window. The math of attention explains why stuffing your whole codebase into a prompt is quietly destroying suggestion quality — and how retrieval-augmented generation fixes it by sending only the right code.
Open a large file in your AI code assistant and ask it to refactor a function buried three hundred lines down. Watch it confidently produce something plausible but wrong — using an interface that was deprecated last sprint, calling a helper that doesn't exist in this service, ignoring a constraint that was clearly described in the module-level docstring. The model did not forget. It never really attended to that information in the first place.
There is a common intuition that more context is always better. If the model can see the whole file, surely it will give better suggestions. If it can see the whole codebase, even better. This intuition is wrong in a specific and measurable way. The mechanism is called attention dilution, and understanding it changes how you think about every AI coding tool you use.
This post explains precisely why the naive approach fails — walking through the transformer attention math — and then explains how RAG solves the problem by ensuring that the tokens occupying your context window are the ones that actually matter.
The Naive Approach: Just Send Everything
The first instinct when building a code assistant is to send as much context as possible. Your project has a utility module? Include it. There's a shared type definitions file? Include that too. If the model's context window is 128,000 tokens, fill it to the brim — more information has to be better, right?
This is called context window stuffing, and it has three distinct failure modes. Each one gets worse as the codebase grows. To understand why, you first need a precise model of how a transformer reads a prompt.
What the model actually receives
A transformer does not read a prompt sequentially, the way a human reads a page from left to right. Instead, it processes all tokens simultaneously, and every token attends to every other token in the sequence. The attention mechanism is the machine that computes how much each token should "look at" every other token when forming its representation.
The output of attention for a single token is a weighted average of all the other tokens' value vectors. The weights are computed by comparing the current token's query vector against every other token's key vector. When you add more tokens to the context, you are not adding more information to a receptive mind — you are adding more competitors for a fixed budget of attention weight.
Imagine you are in a room full of people, all talking at once. You can only pay 100 percent of your attention total — it does not grow with the number of people. With 5 people in the room, each gets roughly 20% of your focus. With 500, each gets 0.2%. When the relevant person finally says something, their share of your attention has collapsed to noise. That is what happens to code buried in a long prompt.
Why Attention Dilutes: The Math
The attention mechanism was introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017). Its core computation is:
Attention(Q, K, V) = softmax( QKT / √dk ) · V
Where:
Q — the query matrix (what each token is "asking for")
K — the key matrix (what each token "offers" for comparison)
V — the value matrix (the actual content passed forward if selected)
dk — the dimension of the key vectors (scales to prevent extreme dot products)
softmax — converts a vector of raw scores into a probability distribution that sums to 1
The notation QKT means: for each token, compute a dot product between its query vector and every other token's key vector. The dot product is large when two vectors point in the same direction (high relevance between the pair), and near zero when they are orthogonal (unrelated). Multiplying by the transposed key matrix KT does all N×N such comparisons in a single matrix operation. The result is a matrix of raw relevance scores. Dividing by √dk prevents those scores from becoming so large that softmax saturates (all weight on one token).
The softmax step is the dilution mechanism. Because softmax always outputs a probability distribution — all values sum to exactly 1 — attention weights are a zero-sum resource. When there are N tokens in the context, the average attention weight is 1/N, regardless of what any individual token does. The total budget is fixed at 1.0.
This does not mean every token gets exactly equal attention — the model can still concentrate on a small subset of tokens if the dot-product scores make those tokens stand out sharply. But the key word is if. In a real codebase, most of the context is irrelevant to any particular completion request. Hundreds of unrelated function definitions produce hundreds of tokens with moderately non-zero dot products. These tokens collectively consume the majority of the softmax budget even though each individual one has low relevance. The useful signal — the few functions that actually matter — must compete against this crowd, and its share of the attention budget shrinks as N grows. Signals that were reliably attended to at N=2,000 tokens can fall below effective threshold at N=50,000 tokens.
Softmax sharpening and temperature
The scaling factor √dk in the formula is there specifically to control how "peaky" the attention distribution is. Without it, dot products in high-dimensional spaces get very large, pushing softmax into saturation — where almost all weight goes to the single highest-scoring token. The scaling keeps the distribution smooth enough for gradients to flow during training. But the same property means that with a very long sequence, the model is doing softmax over hundreds of thousands of values. The distribution becomes nearly uniform across irrelevant tokens, and the useful signal struggles to dominate.
The context window limit is not just a practical engineering constraint — it reflects a genuine quality degradation. The problem is not that the model cannot read long inputs. It is that as context grows, every individual piece of information receives proportionally less attention weight. More input does not mean more comprehension; it means each fact competes harder for finite attentional resources.
Lost in the Middle: Position Bias
Attention dilution is one problem. A second, independent problem compounds it: position bias. Modern language models do not attend to all positions in their context with equal reliability. They preferentially attend to tokens at the beginning and end of the sequence, and perform significantly worse on information placed in the middle.
This phenomenon was studied in a 2023 paper by Nelson Liu et al. titled Lost in the Middle: How Language Models Use Long Contexts. The researchers tested models on multi-document question answering, varying the position of the document containing the answer. The results were sharp: when the answer document was at position 1 (the very beginning) or the last position, accuracy was high. When it was at position 10 of 20 documents, accuracy dropped by more than 30 percentage points — even though the information was technically within the model's context window.
Think of a book report written by someone who only reads the first chapter and the last chapter, and skims the middle. Everything in between is "in the book" — but it functionally does not influence the report. That is what happens to code you inject into the middle of a long prompt. It exists in the context. It is not reliably attended to.
Why does this happen architecturally?
The cause is rooted in RoPE (Rotary Position Embeddings), the positional encoding scheme used in most modern language models. RoPE works by rotating the query and key vectors by angles that are proportional to their positions in the sequence. The key consequence: the dot product between a query at position m and a key at position n includes a term that depends on their relative distance (m−n). As that distance grows, the rotational difference introduces a decay in the inner product — semantically relevant tokens at distant positions must overcome this rotational penalty to receive high attention. Tokens near the beginning of the sequence are at low distance from almost every other token, giving them an inherent structural advantage. The model also learns, through training, to rely heavily on the most recent tokens for predicting the next one. The result: early tokens and the last few tokens both get reliable attention; the middle of a long context is structurally disadvantaged by both effects simultaneously.
A 2024 paper from the University of Washington, MIT, and Google (Found in the Middle) demonstrated that this bias can be partially corrected by calibrating attention weights at inference time — but this requires modifying the model's internals, something you cannot do when calling an API.
For practical purposes, if you are a user or a developer calling a commercial API, the position bias is a fixed environmental constraint. The implication is direct: you cannot mitigate it by stuffing more context. You can only mitigate it by ensuring the relevant content appears at the beginning of your context — which means retrieving it selectively rather than including everything and hoping.
Many teams inject retrieved chunks at the end of the prompt, after a long system prompt and conversation history, reasoning that "the model will see it just before generating." This lands retrieved content in a position that gets the worst of both worlds: far from the beginning (losing the primacy advantage) and not at the very end (which is reserved for the generation target itself). The safest placement for retrieved code context is immediately before the user's specific question, near the end but not buried in the middle of a long history.
The Quadratic Cost Problem
Even if you were willing to accept degraded attention quality, there is a third reason not to stuff context: the compute cost of attention scales quadratically with sequence length.
To compute the full attention matrix, the model must compare every token's query against every other token's key. If your sequence has N tokens, this requires N × N comparisons — producing an N² attention matrix. Doubling the context length quadruples the compute required for attention. This is not a limitation of current hardware; it is a mathematical property of the full-attention transformer architecture.
Time complexity of full self-attention: O(N²·d)
Where N = sequence length, d = model dimension.
Concretely: a 4× increase in context length → 16× increase in attention compute. A 10× increase → 100×. This is why inference latency grows non-linearly as context length grows.
FlashAttention (Dao et al., 2022) improves the memory profile to O(N) by tiling — it never writes the full N×N matrix to GPU DRAM. But the number of floating-point operations is still O(N²). Latency and cost still scale quadratically with sequence length.
In production, this means that a code assistant filling 100,000 tokens of context is not just 10× slower than one filling 10,000 tokens — it is closer to 100× more expensive in attention compute alone (before accounting for the rest of the forward pass). It is also 10× more expensive in the API pricing sense, since most APIs charge per input token. You are paying more to get worse results.
RAG at a Glance: The Core Idea
Retrieval-Augmented Generation reframes the problem. Instead of asking "how can we give the model the whole codebase?", it asks "how do we figure out which parts of the codebase are relevant to this specific completion request, and send only those?"
The answer has two phases. First, an offline indexing phase where the codebase is processed, divided into chunks, and each chunk is converted into a vector representation (an embedding) that captures its semantic meaning. These vectors are stored in an index optimized for fast similarity search. Second, an online retrieval phase that happens at query time: the developer's current context (cursor position, open file, recent changes) is converted into a query vector, and the most similar chunks from the index are retrieved and injected into the prompt.
The model then receives a context window that is not a random cross-section of the codebase — it is the small set of pieces most likely to be relevant to the task at hand. This keeps the context window small, focused, and full of signal rather than noise.
Each of these steps has engineering depth. The three that most affect output quality — and that most implementations get wrong — are chunking, retrieval strategy, and prompt injection order. The next sections cover each in detail.
Chunking for Code: Why Fixed-Size Fails
The purpose of chunking is to divide the codebase into units that can be individually embedded and retrieved. The naive approach is fixed-size chunking: split every file every 256 tokens, regardless of where that lands in the code structure. It is easy to implement. It is almost always wrong for code.
Code has structure that text does not. A function is a unit of meaning. A class definition with its methods is a unit of meaning. Splitting in the middle of a function produces two chunks, each of which is semantically incomplete — one has the function signature and the early logic, the other has the return path. When embedded, neither chunk represents the function accurately. When retrieved, neither chunk gives the model what it needs to understand the function's contract.
Fixed-size chunking failure mode
Consider a Python function that is 80 lines long. With a 50-token chunk size, it gets split into chunks that might look like:
def process_payment(order_id, amount,
currency="USD"):
"""Process a payment..."""
conn = get_db_connection()
try:
txn = conn.begin_transaction(
order_id=order_id,
amount=amount,
currency=currency
) except DatabaseError as e:
log_error(e)
raise PaymentError(str(e))
AST-based chunking
AST-based chunking uses a language parser — specifically tree-sitter — to parse each file and extract logical units. Instead of asking "where does the 256th token fall?", it asks "where are the function boundaries, class boundaries, and docstrings in this file?"
Each extracted unit becomes a chunk. A function definition — signature, body, docstring, and any inline comments — stays together. The chunk's metadata stores its file path, start line, end line, and the type of unit (function, class, method, module-level code). This metadata is as important as the chunk text itself: it tells the retrieval system where in the codebase this chunk lives, enabling it to surface related chunks from the same file.
def process_payment(order_id, amount, currency="USD"):
"""Process a payment for the given order.
Raises PaymentError on failure."""
...
# complete function body
class PaymentGateway:
"""Manages payment provider connections.
Thread-safe singleton."""
_instance = None
...
from payments.models import Order, Transaction
from payments.exceptions import PaymentError
from db import get_db_connection
AST chunking can produce variable-size chunks. A 10-line function and a 200-line class body are both valid chunks under this scheme. For very large functions or classes, a secondary split is sometimes applied — either at method boundaries within the class, or using a maximum token limit with the caveat that the split occurs at a statement boundary, never mid-expression. The right upper bound depends on the embedding model's context window (typically 512–8,192 tokens) and the target chunk granularity.
Sliding-window augmentation
One practical addition: after AST chunking, each chunk is sometimes augmented with a small surrounding context — the preceding import block, the class it belongs to, or the file's module-level docstring. This gives the embedding model enough context to produce a vector that reflects not just the chunk's local meaning but its role in the larger structure. The key is that this surrounding context is used only for embedding, not retrieved as part of the chunk text. The retrieved text remains the original chunk.
Sliding window overlap (copying N tokens from one chunk as the start of the next) is useful in prose documents where the narrative flows continuously. In code it often makes things worse: the overlap introduces duplicate logic into separate chunks, making embedding space crowded with near-identical vectors. For code, the recommended approach is to store a "parent context" chunk separately — always inject the enclosing class signature or module-level import block alongside any function chunk, rather than copying the previous function's body into the current chunk. The continue open-source extension uses this approach in its codebase indexing.
Retrieval Strategies: Dense, Sparse, and Why Code Needs Both
Once the codebase is indexed, retrieval is the step that determines which chunks are surfaced. There are two fundamentally different ways to search a corpus of text, and code needs both simultaneously.
Dense retrieval (semantic search)
Dense retrieval converts both the query and each stored chunk into a numeric vector (an embedding), then finds chunks whose vectors are closest to the query vector by cosine similarity. This approach is powerful because it can match meaning even when the exact words differ. A query like "how do we handle rate limit errors?" will surface functions named throttle_on_429 or backoff_retry — code that addresses the same concern using completely different identifiers.
The embedding model used matters a great deal for code. General-purpose text embedding models (trained primarily on natural-language text) perform poorly on code because their training data underrepresents the structural and syntactic conventions of programming languages. Code-specialized embedding models — such as voyage-code-3 or OpenAI's text-embedding-3-large trained with code data — produce substantially better representations for function bodies, type signatures, and API calls.
Sparse retrieval (keyword search / BM25)
BM25 is a classical information retrieval algorithm that scores documents by how well they match the exact terms in a query. It does not understand meaning; it counts words. But for code, exact keyword matching is often exactly what you want.
If the developer is working on a bug involving PaymentGateway.process_refund, a dense embedding search might return several semantically related functions — but the exact function they need might not score highest on semantic similarity. A BM25 search for the exact string process_refund will find it immediately. Similarly, error codes, configuration key names, and exact API method names are better retrieved lexically than semantically.
PaymentGateway.process_refund — exact name matchRefundProcessor.execute — contains "refund"validate_refund_amount — contains "refund"reverse_charge — semantically related but no keyword matchreverse_charge — high semantic similaritycancel_transaction — semantically relatedPaymentGateway.process_refund — correct but ranked lowerREFUND_TIMEOUT_SEC = 30 — one-line constant; short chunk, sparse embeddingQuery: "process_refund implementation" — neither method alone captures all the relevant chunks.
For prose documents (product documentation, knowledge bases), dense retrieval typically outperforms BM25. For code, the opposite is often true on exact-identifier queries. Code has a high density of unique, domain-specific identifiers — function names, class names, constant names — that appear nowhere in the embedding model's training data and thus have poor semantic representations. For these, BM25 is reliably better. The right system runs both and combines the results.
Hybrid Search and Reciprocal Rank Fusion
Running both BM25 and dense retrieval solves the coverage problem — you surface both the exact-match candidates and the semantically similar ones. But now you have two ranked lists and need to combine them into a single ranked list to pass to the model. This is the rank fusion problem.
The challenge is that BM25 scores and cosine similarity scores live in completely different numerical ranges. BM25 produces scores that depend on corpus statistics (document length, term frequency, corpus size). Cosine similarity is bounded in [-1, 1]. You cannot add or average them directly in a meaningful way.
Reciprocal Rank Fusion (RRF)
RRF avoids the normalization problem entirely by ignoring raw scores and working only with ranks. The word "reciprocal" means 1/x — the score assigned to a document is the reciprocal of its rank in each list. A document ranked 1st gets a high score (1/61 with k=60); a document ranked 50th gets a low score (1/110). For each candidate chunk, the combined RRF score sums these reciprocals across all ranked lists:
RRF_score(d) = Σr ∈ R 1 / (k + rankr(d))
Where:
R — the set of ranked lists (e.g., {BM25 list, dense retrieval list})
rankr(d) — the position of document d in ranked list r (1-indexed)
k — a smoothing constant (typically 60) that prevents a single top rank from dominating. The value 60 was empirically validated as robust across many retrieval tasks in the original paper by Cormack, Clarke & Buettcher (2009). Increasing k makes the formula more conservative — it rewards consistent mid-rank appearances over a single strong rank.
If a document does not appear in a list, its contribution from that list is 0.
The effect of the formula: a document that ranks 1st in the BM25 list and 1st in the dense list will have an RRF score of 1/(60+1) + 1/(60+1) ≈ 0.033. A document that ranks 1st in one list but 100th in the other will score 1/(60+1) + 1/(60+100) ≈ 0.022. Consistently high-ranked documents across multiple sources float to the top. Documents that are strong in only one source get discounted. This is the behavior you want: candidates that both BM25 and semantic search agree on are the most reliably relevant.
When not to use RRF
RRF treats all ranked lists as equally authoritative. In practice you might know that for a particular query type — say, looking up a config constant by its exact name — BM25 should be weighted more heavily. Some systems implement weighted linear combination of normalized scores instead of RRF, using query classification to choose weights. This adds complexity but can improve precision for domain-specific query patterns.
Reranking: The Final Sorting Pass
After hybrid search and RRF, you have a list of, say, 20 candidate chunks. You want to inject only the top 3–5 into the prompt. The question is whether the RRF ranking is good enough to trust for this final cut, or whether a second, more expensive sorting pass is worth running.
This is where cross-encoder rerankers come in. The retrieval methods used so far — both BM25 and dense embedding search — are bi-encoders: they encode the query and each document independently, then compare the resulting vectors. This is fast because the document embeddings are precomputed. But it means the query and document never interact during encoding — the model cannot see the query while deciding what aspects of the document to emphasize.
A cross-encoder takes both the query and a candidate chunk as a single concatenated input and produces a relevance score. Because both texts go through the model at the same time, the model can attend to relationships between the query and the document that a bi-encoder cannot. The accuracy improvement is often significant — but the cost is that you cannot precompute: the cross-encoder must run on every query-candidate pair at inference time.
The practical solution is a two-stage architecture: use the fast bi-encoder pipeline (dense + BM25 + RRF) to get the top 20 candidates, then run a cross-encoder on only those 20 to get a final ranking. The cross-encoder sees a small, manageable candidate set and produces a high-quality relevance ordering. The top 5 from this final ranking go into the prompt.
Cross-encoders are themselves transformer models with context window limits. A reranker fed a 2,000-token code chunk and a 500-token query is processing a 2,500-token combined input. Most openly available reranker models (e.g. ms-marco-MiniLM-L-12) have 512-token limits. For code chunks, use a reranker with a larger context window (Cohere's Rerank 3 supports 4,096 tokens; voyage-rerank-2 supports 16K). If the combined chunk+query exceeds the reranker's limit, truncate the chunk from the end, not the beginning — the function signature and docstring at the top are more important for reranking than the implementation body.
For general-purpose document retrieval, cross-encoder reranking consistently improves recall@5 by 10–20% over bi-encoder retrieval alone. For code, the benefit depends on how well your embedding model was trained on code. If you are using a strong code-specific embedding model with good AST chunking, the hybrid bi-encoder retrieval alone may be good enough for most queries. Reranking becomes most valuable when queries are ambiguous or when the codebase has many semantically similar functions that need fine-grained disambiguation. It adds latency (typically 50–200ms), so benchmark before committing.
How Cursor Does It: A Reference Architecture
Cursor is the most architecturally transparent commercial code assistant. Its codebase indexing implementation has been publicly described in enough detail to serve as a concrete reference for the pipeline above.
Indexing
When you open a project in Cursor, it begins indexing the codebase in the background. Files are chunked locally on your machine. Chunks are then sent to Cursor's servers, where they are passed through an embedding model — either OpenAI's embedding API or a custom-trained model, depending on the feature context. The resulting vectors are stored in Turbopuffer, Cursor's vector store of choice. Metadata — file path, line ranges, and a hash of the chunk content — is stored alongside each vector. File paths are obfuscated client-side before any data leaves your machine.
Cursor caches embeddings by chunk hash. The second time you index the same codebase (or if most files haven't changed), it skips recomputing embeddings for unchanged chunks, making incremental re-indexing fast.
Query signal construction
The query is not just "what the developer typed." Cursor monitors the active cursor position and constructs a composite signal from several sources: the current file's surrounding code at the cursor position, any open editor tabs (which it weights as likely-related files), and recent edit history. This composite signal is embedded into a query vector.
Retrieval and injection
The query vector is sent to Turbopuffer, which performs ANN search, returning the top-k most similar chunk vectors. Cursor's client receives the result with obfuscated file paths and line ranges, then reads the actual code from the local filesystem. The retrieved chunks are injected into the prompt sent to the LLM. The model never directly touches the vector database — it only sees the retrieved text.
When you type @Codebase in Cursor's chat, you are explicitly triggering the full RAG pipeline: embed the query, search the codebase index, retrieve top chunks, inject into prompt. Without @Codebase, Cursor uses a lighter heuristic — open tabs, recent edits, file imports — to construct the context. The @Codebase symbol is the manual override that says "do a full retrieval pass over the whole indexed codebase." Additional retrieval sources — @Docs (fetches indexed documentation) and @Web (live web search) — extend the same retrieval pipeline beyond the local codebase.
One important architectural note: the embedding model used to index the codebase is separate from the generative model used to produce completions. Cursor uses a lightweight, fast embedding model for indexing (optimized for latency and throughput over millions of chunks) and a larger, slower generative model for the actual completion. When building a similar system, these two components have independent optimization concerns — do not assume the same model serves both roles.
GitHub Copilot's approach
GitHub Copilot's context construction is less publicly documented but follows a similar pattern. For inline completion, it uses the current file content around the cursor (the prefix and suffix of the current file) plus a Jaccard similarity heuristic to find other open tabs that share significant token overlap with the current file. The @workspace symbol in VS Code triggers a more thorough indexing-based search of the workspace, analogous to Cursor's @Codebase.
The key distinction: Copilot's default inline completion mode is a fast, low-latency path that does not run full vector retrieval on every keystroke. Full retrieval is reserved for explicit chat interactions. This is a deliberate latency-accuracy tradeoff — inline completion needs to respond in under 100ms to feel responsive; full RAG retrieval adds 200–500ms of overhead.
Tradeoffs and Limits of Code RAG
RAG is not a complete solution. Understanding its failure modes is essential for building a system that handles them gracefully.
| Scenario | RAG Behavior | Mitigation |
|---|---|---|
| Cross-file dependency reasoning | Each retrieved chunk is a fragment. The model may not understand how three retrieved functions compose at the call site. | Include file path + line range metadata; retrieve parent class or module-level imports alongside function bodies. |
| Newly created files not yet indexed | Files added after indexing are invisible to retrieval until the index is rebuilt. | Incremental indexing on file-save events. Maintain a "pending index" queue that is flushed every N seconds. |
| Query is too vague | A query like "fix the bug" produces a query vector close to everything and nothing. Retrieved chunks will be generic. | Prompt the user to specify the symbol or file. Use the cursor position + surrounding error message as the primary query signal. |
| Minified or generated code | Minified JS, protobuf generated code, and lock files produce chunks with very low semantic density. They pollute the index and retrieve irrelevant noise. | Maintain a .gitignore-style ignore list for the RAG indexer. Exclude node_modules, *.pb.go, build directories. |
| Very large monorepos | Indexing millions of files takes significant time and storage. Retrieval recall degrades on very large indices unless the ANN index structure is tuned. | Scope the index to the subdirectory the developer is currently working in, or use per-service sub-indices with routing logic. |
| Schema / type changes | If a type changes but the embedding was computed from the old version, retrieved chunks may give the model an outdated type signature. | Invalidate embeddings on file write. Use chunk content hash to detect staleness and trigger recompute. |
The "big context window makes RAG obsolete" argument
As context windows grow to 1M and 10M tokens (Llama 4 Scout reached 10M tokens in 2025), a natural question arises: at some point, can we simply include the entire codebase in context and skip RAG entirely?
The answer, in practice, is no — for three reasons. First, even a modest codebase of 200,000 lines of Python easily exceeds 2 million tokens. Most production monorepos are far larger. Context windows are growing, but so are codebases. Second, as we established in Sections 2 and 3, attention quality degrades with length regardless of the window's nominal size limit. A 2M-token context window does not mean 2M tokens of equal-quality attention. Third, latency and cost penalties remain. A 1M-token prompt is extraordinarily expensive and slow to process, even on the best hardware. For interactive coding assistance that must respond in under a second, this is impractical.
The most defensible position: large context windows and RAG are complementary tools, not alternatives. RAG determines what to include; the context window determines how much you can include once you've been selective. A well-designed system uses RAG to retrieve the right 5,000 tokens from a 10M-token codebase and places them into a 128K context window alongside conversation history and tool outputs.
Building Your Own Code RAG Pipeline
If you are building a coding assistant, an internal developer tool, or a code-aware agent, here is a reference stack that covers the key decisions.
Parsing and chunking
Use tree-sitter with the appropriate grammar for each language. The tree-sitter-python, tree-sitter-typescript, and tree-sitter-go grammars are mature and battle-tested. Extract function definitions (function_definition node type in Python) and class definitions as primary chunk types. For functions longer than your embedding model's context limit, split at the statement level within the function body.
# Pseudo-code: AST chunk extraction with tree-sitter (v0.21+ API)
import tree_sitter_python as tspython
from tree_sitter import Language, Parser
PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)
TARGET_TYPES = {"function_definition", "class_definition"}
def walk_tree(node, source_code: str, file_path: str, chunks: list):
"""Recursively walk the AST to catch nested definitions
(methods inside classes, functions inside functions, etc.)"""
if node.type in TARGET_TYPES:
chunk_text = source_code[node.start_byte:node.end_byte]
chunks.append({
"text": chunk_text,
"file": file_path,
"start_line": node.start_point[0],
"end_line": node.end_point[0],
"type": node.type
})
# For class_definition, continue recursing to capture methods.
# For function_definition, stop — we want the whole function,
# not its nested helpers as separate chunks.
if node.type == "class_definition":
for child in node.children:
walk_tree(child, source_code, file_path, chunks)
else:
for child in node.children:
walk_tree(child, source_code, file_path, chunks)
def extract_chunks(source_code: str, file_path: str) -> list[dict]:
tree = parser.parse(source_code.encode())
chunks = []
walk_tree(tree.root_node, source_code, file_path, chunks)
return chunks
Embedding model selection
For a production code assistant, the embedding model choice matters significantly. Three strong options:
| Model | Context window | Strengths | When to use |
|---|---|---|---|
voyage-code-3 |
16K tokens | Purpose-built for code; top-ranked on code retrieval benchmarks (2025) | Production code assistant, maximum retrieval quality |
text-embedding-3-large |
8K tokens | Strong general performance; well-supported; large community | Mixed code + documentation retrieval; existing OpenAI integrations |
nomic-embed-code |
8K tokens | Open-weight; can run locally; no API cost | Air-gapped environments; cost-sensitive deployments; on-prem |
Vector store
For a single-developer or small-team tool: pgvector in a local Postgres instance is often sufficient. For a service handling multiple users: Qdrant is a strong choice — it supports both dense and sparse vectors in a single collection, enabling native hybrid search without maintaining two separate stores.
BM25 index
For pure BM25, Tantivy (the Rust-based full-text search library behind Qdrant's sparse vector support) or Elasticsearch's BM25 are the standard options. For a simpler deployment, the Python rank_bm25 library is adequate for corpora under ~50,000 chunks.
The prompt injection template
Placement and formatting of retrieved context matters. A template that works well in practice:
You are a coding assistant for this codebase.
## Relevant context from the codebase:
### [payments/gateway.py · lines 42–87]
```python
{chunk_1_text}
```
### [payments/exceptions.py · lines 1–24]
```python
{chunk_2_text}
```
### [payments/models.py · lines 88–112]
```python
{chunk_3_text}
```
## Current task:
{user_request}
Including file path and line numbers in each chunk header gives the model two useful signals: the module structure of the codebase, and the ability to reference specific locations in its response. These headers cost very few tokens but significantly improve the quality of generated code that needs to import from or reference the retrieved files.
It is tempting to inject 10–15 retrieved chunks to "give the model more information." Resist this. Each additional chunk increases the context size (paying the quadratic cost discussed in Section 4), increases the probability that the model attends to irrelevant material, and reduces the proportion of the context that is highly relevant. In practice, 3–5 high-quality chunks typically outperform 15 lower-quality ones. Invest in retrieval quality, not retrieval quantity.
The Through-Line
We started with a simple observation: AI code assistants give worse suggestions when given more context, and this is not a bug in the assistant — it is a predictable consequence of how transformer attention works. The softmax normalization at the heart of attention means every token competes for a fixed pool of attention weight. As the sequence grows, the signal-to-noise ratio of any particular piece of information falls. The "lost in the middle" position bias compounds this: information that lands in the middle of a long context is particularly vulnerable to being missed.
RAG is the systematic answer. By indexing the codebase offline — chunking at AST-defined boundaries, embedding with code-specialized models, building both dense and lexical indices — and by retrieving only the highest-signal chunks at query time, a well-designed system puts the model in the best possible epistemic position. The context window contains what matters. Attention is not wasted on thousands of lines that have nothing to do with the task.
The components covered here — AST chunking, hybrid BM25 + dense retrieval, RRF fusion, optional cross-encoder reranking, and careful prompt injection — are the building blocks of every serious code assistant architecture in production today. Each one addresses a specific failure mode of the naive approach. Together they turn a context window from a limitation into a precision instrument.
A practical question: when does this actually matter? For a codebase under ~3,000–5,000 lines, context stuffing usually works well enough — you can fit the most relevant files in a 32K context window and the model will find what it needs. Above that threshold, the problems compound rapidly. At 50,000+ lines, naive context stuffing reliably degrades suggestion quality. At 500,000+ lines, it stops working altogether as even the most relevant file is one voice in an overwhelming crowd. RAG is not premature optimization at that scale — it is the only viable approach.
Code RAG retrieves relevant chunks within a session. But when a session ends, the AI's working understanding of the codebase — which files matter, which edge cases were found, what is still missing — is gone. The next session re-discovers from scratch. A two-phase benchmark run against the Apache Camel codebase (5,856 files, unfamiliar enterprise Java) demonstrated this concretely: the vanilla agent spent 51 tool calls re-exploring in Phase 2 and produced 0 bytes on one task. The same task with cross-session memory (structured notes stored during Phase 1 research, recalled at Phase 2 start) produced a complete 5-file implementation at −58% Phase 2 cost. Code RAG and working memory are complementary: RAG retrieves the right code at query time; memory preserves what was learned across the session boundary.
References & Further Reading
- Attention Is All You Need — Vaswani et al., 2017.
- Lost in the Middle: How Language Models Use Long Contexts — Liu et al., 2023.
- Found in the Middle: Calibrating Positional Attention Bias — He et al., 2024.
- Retrieval-Augmented Code Generation: A Survey — 2025.
- RAG vs Large Context Window: Real Trade-offs for AI Apps — Redis Engineering Blog.
- Context Window Optimization: Why Ranking, Not Stuffing, Is the Scaling Law for Agents — Shaped AI.
- RAG for LLM Code Generation using AST-Based Chunking — Vishnudhat Natarajan, Medium.
- Better Retrieval Beats Better Models for Large Codebases — Stéphane Derosiaux.
- How Cursor Actually Indexes Your Codebase — Towards Data Science.
- How GitHub Copilot Works — Quastor Engineering.
- What is Retrieval-Augmented Generation? — GitHub Blog.
- Why Cursor, Claude Code, and Devin Use grep, Not Vectors — MindStudio.
- BM25 vs Dense Retrieval for RAG: What Actually Breaks in Production — Ranjan Kumar.
- Hybrid Search: BM25 and Dense Retrieval Combined — Michael Brenndoerfer.