Why does an AI code assistant give worse suggestions when the file is large?

Transformer attention is computed across all tokens simultaneously. As context grows, each token's attention weight gets divided among more tokens (softmax normalization). Information positioned in the middle of a long context suffers a U-shaped attention bias — the model reliably attends to the beginning and end but degrades significantly on middle content. More tokens also increases latency and cost quadratically.

What is context window stuffing and why is it bad?

Context window stuffing means sending as much text as possible to the LLM hoping the relevant information is somewhere in it. It fails because the model's finite attention budget spreads thinner as context grows, quadratic compute cost makes inference slower and costlier, and the lost-in-the-middle effect means content placed in the middle of a long context is effectively ignored.

How does RAG help AI code assistants?

RAG (Retrieval-Augmented Generation) first indexes the codebase by chunking files into semantically meaningful units and embedding them as vectors. At query time, only the most relevant chunks are retrieved and injected into the prompt. This means the context window contains high-signal content — the right types, interfaces, and functions — rather than large swaths of unrelated code.

What is AST-based chunking and why does code need it?

Abstract Syntax Tree (AST) chunking uses a language parser to split code at natural structural boundaries — function definitions, class bodies, method groups — rather than at arbitrary token counts. Fixed-size chunking splits in the middle of a function, destroying the logical unit. AST chunking preserves complete, semantically coherent units that embedding models can represent more accurately.

What is the difference between BM25 and dense retrieval for code search?

BM25 is a lexical (keyword) retrieval algorithm. It excels at exact matches — finding a function by its precise name or a specific error code. Dense retrieval uses neural embeddings to find semantically similar code even when the query uses different words than the code. Code search needs both: exact identifier lookup (BM25) and conceptual similarity search (dense). Hybrid search combines both scores.

How does Cursor index a codebase for RAG?

Cursor chunks local files, sends chunks to its servers, embeds them using an embedding model, and stores the embeddings in a vector database (Turbopuffer). File paths are obfuscated before transmission. At query time, the cursor position and open tabs create a query, which is embedded and compared against stored chunk embeddings. Top-k matches are retrieved and inserted into the LLM prompt as context.

Will larger context windows make RAG obsolete?

No. Even with 1M-token context windows, stuffing an entire large codebase is impractical — a medium codebase of 500k lines easily exceeds 2M tokens. More importantly, attention quality degrades with length regardless of the window limit. RAG selects the right signal before sending to the model; a large window does not fix the fundamental attention dilution problem.

Deep Dive · AI Engineering

Why AI Code Assistants Waste Your Context Window — and How RAG Fixes It

A large context window does not mean a better context window. The math of attention explains why stuffing your whole codebase into a prompt is quietly destroying suggestion quality — and how retrieval-augmented generation fixes it by sending only the right code.

Date 24 May 2026

Read ~35 min

Demos 3 interactive

Sections 12

Open a large file in your AI code assistant and ask it to refactor a function buried three hundred lines down. Watch it confidently produce something plausible but wrong — using an interface that was deprecated last sprint, calling a helper that doesn't exist in this service, ignoring a constraint in the module-level docstring that it technically "saw." The model didn't forget. The information was technically present in the prompt, but the transformer's attention mechanism never meaningfully focused on it. That's a different kind of failure, and it doesn't get better with a bigger context window.

There's a persistent intuition in this industry that more context is always better. Send the whole file. Send the whole codebase. This intuition breaks in a specific and measurable way. The mechanism is called attention dilution — softmax normalization means that every token in the context competes for a fixed budget of attention weight, and as the sequence grows longer, any given piece of information gets a smaller share of that budget.

This post walks through the transformer attention math to explain exactly why the naive approach fails, then covers how RAG (Retrieval-Augmented Generation) addresses it — by retrieving only the specific code chunks relevant to the current task and injecting those into the context window instead of dumping everything.

Part 01

The Problem with Stuffing

The Naive Approach: Just Send Everything

The first instinct when building a code assistant is to send as much context as possible. Your project has a utility module? Include it. There's a shared type definitions file? Throw that in too. If the model's context window is 128,000 tokens, fill it to the brim — more information has to be better, right?

This is called context window stuffing. Three things go wrong with it, and each gets worse as the codebase grows. The first is attention dilution — the focus of this section. The second is position bias (Section 03). The third is raw cost (Section 04). To understand why these happen, you need a concrete model of how a transformer actually reads a prompt.

What the model actually receives

A transformer does not read a prompt sequentially, the way a human reads a page from left to right. Instead, it processes all tokens simultaneously, and every token attends to every other token in the sequence. The attention mechanism is the machine that computes how much each token should "look at" every other token when forming its representation.

The output of attention for a single token is a weighted average of all the other tokens' value vectors. The weights are computed by comparing the current token's query vector against every other token's key vector. When you add more tokens to the context, you are not adding more information to a receptive mind — you are adding more competitors for a fixed budget of attention weight.

Analogy

Imagine you are in a room full of people, all talking at once. You can only pay 100 percent of your attention total — it does not grow with the number of people. With 5 people in the room, each gets roughly 20% of your focus. With 500, each gets 0.2%. When the relevant person finally says something, their share of your attention has collapsed to noise. That is what happens to code buried in a long prompt.

Why Attention Dilutes: The Math

The attention mechanism was introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017). Its core computation is:

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax( QK^T / √d_k ) · V

Where:

Q — the query matrix (what each token is "asking for")
K — the key matrix (what each token "offers" for comparison)
V — the value matrix (the actual content passed forward if selected)
d_k — the dimension of the key vectors (scales to prevent extreme dot products)
softmax — converts a vector of raw scores into a probability distribution that sums to 1

The notation QK^T means: for each token, compute a dot product between its query vector and every other token's key vector. The dot product is large when two vectors point in the same direction (high relevance between the pair), and near zero when they are orthogonal (unrelated). Multiplying by the transposed key matrix K^T does all N×N such comparisons in a single matrix operation. The result is a matrix of raw relevance scores. Dividing by √d_k prevents those scores from becoming so large that softmax saturates (all weight on one token).

The softmax step is the dilution mechanism. Because softmax always outputs a probability distribution — all values sum to exactly 1 — attention weights are a zero-sum resource. When there are N tokens in the context, the average attention weight is 1/N, regardless of what any individual token does. The total budget is fixed at 1.0.

This does not mean every token gets exactly equal attention — the model can still concentrate on a small subset if the dot-product scores separate those tokens sharply from the rest. Softmax is non-linear and can be quite aggressive when there is a large score gap between relevant and irrelevant tokens. But in a real codebase, that gap is rarely clean. Hundreds of unrelated function definitions produce hundreds of tokens with moderately non-zero dot products — they're not completely irrelevant, they just aren't what you need right now. These tokens collectively consume most of the softmax budget. The useful signal — the few functions that actually matter — must compete against this crowd, and as N grows, the signal's share degrades continuously. It isn't a cliff; it's a steady erosion that compounds with each additional file you stuff in.

Softmax sharpening and temperature

The scaling factor √d_k in the formula is there specifically to control how "peaky" the attention distribution is. Without it, dot products in high-dimensional spaces get very large, pushing softmax into saturation — where almost all weight goes to the single highest-scoring token. The scaling keeps the distribution smooth enough for gradients to flow during training. But the same property means that with a very long sequence, the model is doing softmax over hundreds of thousands of values. The distribution becomes nearly uniform across irrelevant tokens, and the useful signal struggles to dominate.

Key Insight

The context window limit is not just a practical engineering constraint — it reflects a genuine quality degradation. The problem is not that the model cannot read long inputs. It is that as context grows, every individual piece of information receives proportionally less attention weight. More input does not mean more comprehension; it means each fact competes harder for finite attentional resources.

Interactive Demo 01

Attention Budget Calculator

Observe how adding more tokens to a context window forces each useful signal to compete harder for attention weight. Adjust context size and the number of relevant chunks to see how the signal-to-noise ratio changes.

Context tokens 8,000

Relevant tokens in context 400

Avg attention "boost" (×) 3.0×

Baseline weight / token

—

= 1 / context tokens

Relevant token weight

—

boosted by × factor

Signal share

total attn on relevant chunks

Noise share

—

attention on irrelevant tokens

This is a simplified approximation. Real softmax is non-linear — the weight each token gets depends exponentially on its dot-product score relative to all others — so the actual signal share in a real model will vary based on how sharply the relevant tokens outscore the irrelevant ones. The directional prediction holds: more irrelevant tokens in context, lower signal share for the useful ones.

Lost in the Middle: Position Bias

Attention dilution is one problem. A second, independent problem compounds it: position bias. Modern language models do not attend to all positions in their context with equal reliability. They preferentially attend to tokens at the beginning and end of the sequence, and perform significantly worse on information placed in the middle.

This phenomenon was studied in a 2023 paper by Nelson Liu et al. titled Lost in the Middle: How Language Models Use Long Contexts. The researchers tested models on multi-document question answering, varying the position of the document containing the answer. The results were sharp: when the answer document was at position 1 (the very beginning) or the last position, accuracy was high. When it was at position 10 of 20 documents, accuracy dropped by more than 30 percentage points — even though the information was technically within the model's context window.

Analogy

Think of a book report written by someone who only reads the first chapter and the last chapter, and skims the middle. Everything in between is "in the book" — but it functionally does not influence the report. That is what happens to code you inject into the middle of a long prompt. It exists in the context. It is not reliably attended to.

Why does this happen architecturally?

Two mechanisms contribute. The first is RoPE (Rotary Position Embeddings), which is the positional encoding scheme in most modern open-source language models (LLaMA, Mistral, GPT-NeoX). RoPE encodes position by rotating the query and key vectors by angles proportional to their positions. The dot product between a query at position m and a key at position n includes a term that decays with relative distance (m−n) — semantically relevant tokens far from the query position must overcome a rotational penalty to receive attention weight. Tokens near the start of the sequence are close to almost every other position, giving them a structural advantage.

The second mechanism is causal training bias. Language models are trained to predict the next token given all previous tokens. This reward signal pushes models to weight recent tokens heavily — the immediately preceding context is almost always the most relevant signal for next-token prediction during training. The middle of a long context rarely dominated training gradients, so models systematically underweight it. This effect was documented in GPT-3.5 era models well before RoPE became standard — it isn't purely an artifact of positional encoding, it's baked into causal pretraining. Both effects run in the same direction: the middle of a long context is structurally disadvantaged.

A 2024 paper from the University of Washington, MIT, and Google (Found in the Middle) demonstrated that this bias can be partially corrected by calibrating attention weights at inference time — but this requires modifying the model's internals, something you cannot do when calling an API.

For practical purposes, if you are a user or a developer calling a commercial API, the position bias is a fixed environmental constraint. The implication is direct: you cannot mitigate it by stuffing more context. You can only mitigate it by ensuring the relevant content appears at the beginning of your context — which means retrieving it selectively rather than including everything and hoping.

Common Mistake

Many teams inject retrieved chunks at the end of the prompt, after a long system prompt and conversation history, reasoning that "the model will see it just before generating." This lands retrieved content in a position that gets the worst of both worlds: far from the beginning (losing the primacy advantage) and not at the very end (which is reserved for the generation target itself). The safest placement for retrieved code context is immediately before the user's specific question, near the end but not buried in the middle of a long history.

The Quadratic Cost Problem

Even if you were willing to accept degraded attention quality, there is a third reason not to stuff context: the compute cost of attention scales quadratically with sequence length.

To compute the full attention matrix, the model must compare every token's query against every other token's key. If your sequence has N tokens, this requires N × N comparisons — producing an N² attention matrix. Doubling the context length quadruples the compute required for attention. This is not a limitation of current hardware; it is a mathematical property of the full-attention transformer architecture.

Attention Complexity

Time complexity of full self-attention: O(N²·d)

Where N = sequence length, d = model dimension.

Concretely: a 4× increase in context length → 16× increase in attention compute. A 10× increase → 100×. This is why inference latency grows non-linearly as context length grows.

FlashAttention (Dao et al., 2022) improves the memory profile to O(N) by tiling — it never writes the full N×N matrix to GPU DRAM. But the number of floating-point operations is still O(N²). Latency and cost still scale quadratically with sequence length.

In production, this means that a code assistant filling 100,000 tokens of context is not just 10× slower than one filling 10,000 tokens — it is closer to 100× more expensive in attention compute alone (before accounting for the rest of the forward pass). It is also 10× more expensive in the API pricing sense, since most APIs charge per input token. You are paying more to get worse results.

Interactive Demo 02

Context Size Cost Model

See how latency, relative attention compute cost, and token spend change as context size grows. The goal is not to memorize the numbers — it is to build intuition for the non-linear penalty you pay for stuffing context.

Context size (tokens) 8,000

Cost per 1K input tokens ($) $0.003

Requests per day 5,000

Relative attn compute

—

vs 8K baseline (quadratic)

Input token cost / req

—

at selected price/1K

Daily input cost

—

all requests combined

Monthly projection

—

30-day estimate

Costs are input-token only. Output costs and infrastructure overhead add further. The quadratic attention compute ratio is exact; the financial projection assumes linear API pricing per token.

Part 02

How RAG Fixes It

RAG at a Glance: The Core Idea

Retrieval-Augmented Generation reframes the problem. Instead of asking "how can we give the model the whole codebase?", it asks "how do we figure out which parts of the codebase are relevant to this specific completion request, and send only those?"

The answer has two phases. First, an offline indexing phase where the codebase is processed, divided into chunks, and each chunk is converted into a vector representation (an embedding) that captures its semantic meaning. These vectors are stored in an index optimized for fast similarity search. Second, an online retrieval phase that happens at query time: the developer's current context (cursor position, open file, recent changes) is converted into a query vector, and the most similar chunks from the index are retrieved and injected into the prompt.

The model then receives a context window that is not a random cross-section of the codebase — it is the small set of pieces most likely to be relevant to the task at hand. This keeps the context window small, focused, and full of signal rather than noise.

Offline · Once

1. Parse & Chunk

Parse each file with a language-aware parser. Split at natural boundaries — function definitions, class bodies — rather than arbitrary token counts.

Offline · Once

2. Embed Chunks

Pass each chunk through a code embedding model. Store the resulting vector alongside the chunk text, file path, and line range.

Offline · Once

3. Build Search Index

Store vectors in an ANN (Approximate Nearest Neighbor) index for fast retrieval. Build a BM25 lexical index in parallel.

Online · Per Request

4. Embed the Query

Convert the current cursor context into a query vector using the same embedding model.

Online · Per Request

5. Retrieve Top-k Chunks

Run hybrid search (dense + BM25). Fuse the ranked lists. Optionally rerank with a cross-encoder for final ordering.

Online · Per Request

6. Inject & Generate

Inject the top 3–5 retrieved chunks as context into the LLM prompt, immediately before the specific request. Generate the completion.

Steps 1–3 happen once (or on incremental file changes). Steps 4–6 happen on every completion request. The parts where most implementations go wrong: chunking (using fixed-size splits instead of AST boundaries), retrieval (using only dense search and missing exact identifier queries), and injection order (burying retrieved context in the middle of the prompt). Sections 06–09 cover each.

Chunking for Code: Why Fixed-Size Fails

The purpose of chunking is to divide the codebase into units that can be individually embedded and retrieved. The naive approach is fixed-size chunking: split every file every 256 tokens, regardless of where that lands in the code structure. It is easy to implement. It is almost always wrong for code.

Code has structure that text does not. A function is a unit of meaning. A class definition with its methods is a unit of meaning. Splitting in the middle of a function produces two chunks, each of which is semantically incomplete — one has the function signature and the early logic, the other has the return path. When embedded, neither chunk represents the function accurately. When retrieved, neither chunk gives the model what it needs to understand the function's contract.

Fixed-size chunking failure mode for code RAG

Consider a Python function that is 80 lines long. With a 50-token chunk size, it gets split into chunks that might look like:

Chunk A (bad split)

def process_payment(order_id, amount,
currency="USD"):
    """Process a payment..."""
    conn = get_db_connection()
    try:
        txn = conn.begin_transaction(

Chunk B (bad split)

  order_id=order_id,
  amount=amount,
  currency=currency
)    except DatabaseError as e:
        log_error(e)
        raise PaymentError(str(e))

Split mid-transaction block. Neither chunk makes sense in isolation. The embedding of Chunk A does not represent "a payment processing function" — it represents a truncated fragment.

AST-based chunking

AST-based chunking uses a language parser — specifically tree-sitter — to parse each file and extract logical units. Instead of asking "where does the 256th token fall?", it asks "where are the function boundaries, class boundaries, and docstrings in this file?"

Each extracted unit becomes a chunk. A function definition — signature, body, docstring, and any inline comments — stays together. The chunk's metadata stores its file path, start line, end line, and the type of unit (function, class, method, module-level code). This metadata is as important as the chunk text itself: it tells the retrieval system where in the codebase this chunk lives, enabling it to surface related chunks from the same file.

Function chunk (AST)

def process_payment(order_id, amount, currency="USD"):
    """Process a payment for the given order.
    Raises PaymentError on failure."""
    ...
    # complete function body

Class chunk (AST)

class PaymentGateway:
    """Manages payment provider connections.
    Thread-safe singleton."""
    _instance = None
    ...

Import block (AST)

from payments.models import Order, Transaction
from payments.exceptions import PaymentError
from db import get_db_connection

Each chunk is a complete semantic unit. When retrieved, the model gets a coherent piece of the codebase — not a fragment.

On chunk size

AST chunking can produce variable-size chunks. A 10-line function and a 200-line class body are both valid chunks under this scheme. For very large functions or classes, a secondary split is sometimes applied — either at method boundaries within the class, or using a maximum token limit with the caveat that the split occurs at a statement boundary, never mid-expression. The right upper bound depends on the embedding model's context window (typically 512–8,192 tokens) and the target chunk granularity.

Sliding-window augmentation

One practical addition: after AST chunking, each chunk is sometimes augmented with a small surrounding context — the preceding import block, the class it belongs to, or the file's module-level docstring. This gives the embedding model enough context to produce a vector that reflects not just the chunk's local meaning but its role in the larger structure. The key is that this surrounding context is used only for embedding, not retrieved as part of the chunk text. The retrieved text remains the original chunk.

The overlap trap in code

Sliding window overlap (copying N tokens from one chunk as the start of the next) is useful in prose documents where the narrative flows continuously. In code it often makes things worse: the overlap introduces duplicate logic into separate chunks, making embedding space crowded with near-identical vectors. For code, the recommended approach is to store a "parent context" chunk separately — always inject the enclosing class signature or module-level import block alongside any function chunk, rather than copying the previous function's body into the current chunk. The Continue open-source IDE extension uses this approach in its codebase indexing.

Retrieval Strategies: Dense, Sparse, and Why Code Needs Both

Once the codebase is indexed, retrieval is the step that determines which chunks are surfaced. There are two fundamentally different ways to search a corpus of text, and code needs both simultaneously.

Dense retrieval: semantic code search with embeddings

Dense retrieval converts both the query and each stored chunk into a numeric vector (an embedding), then finds chunks whose vectors are closest to the query vector by cosine similarity. This approach is powerful because it can match meaning even when the exact words differ. A query like "how do we handle rate limit errors?" will surface functions named throttle_on_429 or backoff_retry — code that addresses the same concern using completely different identifiers.

The embedding model used matters a great deal for code. General-purpose text embedding models (trained primarily on natural-language text) perform poorly on code because their training data underrepresents the structural and syntactic conventions of programming languages. Code-specialized models like voyage-code-3 — purpose-built for code retrieval — produce substantially better representations for function bodies, type signatures, and API calls than general models like text-embedding-3-large, which is a strong general-purpose embedding model but wasn't specifically designed around code.

Sparse retrieval: BM25 keyword search for exact identifiers

BM25 is a classical information retrieval algorithm that scores documents by how well they match the exact terms in a query. It does not understand meaning; it counts words. But for code, exact keyword matching is often exactly what you want.

If the developer is working on a bug involving PaymentGateway.process_refund, a dense embedding search might return several semantically related functions — but the exact function they need might not score highest on semantic similarity. A BM25 search for the exact string process_refund will find it immediately. Similarly, error codes, configuration key names, and exact API method names are better retrieved lexically than semantically.

BM25 (Lexical)

rank 1 PaymentGateway.process_refund — exact name match

rank 2 RefundProcessor.execute — contains "refund"

rank 3 validate_refund_amount — contains "refund"

miss reverse_charge — semantically related but no keyword match

Dense (Semantic)

rank 1 reverse_charge — high semantic similarity

rank 2 cancel_transaction — semantically related

rank 3 PaymentGateway.process_refund — correct but ranked lower

miss REFUND_TIMEOUT_SEC = 30 — one-line constant; short chunk, sparse embedding

Query: "process_refund implementation" — neither method alone captures all the relevant chunks.

The code search asymmetry

For prose documents (product documentation, knowledge bases), dense retrieval typically outperforms BM25. For code, the opposite is often true on exact-identifier queries. Code has a high density of unique, domain-specific identifiers — function names, class names, constant names — that appear nowhere in the embedding model's training data and thus have poor semantic representations. For these, BM25 is reliably better. The right system runs both and combines the results.

Hybrid Search and Reciprocal Rank Fusion

Running both BM25 and dense retrieval solves the coverage problem — you surface both the exact-match candidates and the semantically similar ones. But now you have two ranked lists and need to combine them into a single ranked list to pass to the model. This is the rank fusion problem.

The challenge is that BM25 scores and cosine similarity scores live in completely different numerical ranges. BM25 produces scores that depend on corpus statistics (document length, term frequency, corpus size). Cosine similarity is bounded in [-1, 1]. You cannot add or average them directly in a meaningful way.

Reciprocal Rank Fusion (RRF)

RRF avoids the normalization problem entirely by ignoring raw scores and working only with ranks. The word "reciprocal" means 1/x — the score assigned to a document is the reciprocal of its rank in each list. A document ranked 1st gets a high score (1/61 with k=60); a document ranked 50th gets a low score (1/110). For each candidate chunk, the combined RRF score sums these reciprocals across all ranked lists:

Reciprocal Rank Fusion Score

RRF_score(d) = Σ_{r ∈ R} 1 / (k + rank_r(d))

Where:

R — the set of ranked lists (e.g., {BM25 list, dense retrieval list})
rank_r(d) — the position of document d in ranked list r (1-indexed)
k — a smoothing constant (typically 60) that prevents a single top rank from dominating. The value 60 was empirically validated as robust across many retrieval tasks in the original paper by Cormack, Clarke & Buettcher (2009). Increasing k makes the formula more conservative — it rewards consistent mid-rank appearances over a single strong rank.
If a document does not appear in a list, its contribution from that list is 0.

The effect of the formula: a document that ranks 1st in the BM25 list and 1st in the dense list will have an RRF score of 1/(60+1) + 1/(60+1) ≈ 0.033. A document that ranks 1st in one list but 100th in the other will score 1/(60+1) + 1/(60+100) ≈ 0.022. Consistently high-ranked documents across multiple sources float to the top. Documents that are strong in only one source get discounted. This is the behavior you want: candidates that both BM25 and semantic search agree on are the most reliably relevant.

Interactive Demo 03

Reciprocal Rank Fusion Visualizer

See how RRF combines a BM25 result list and a semantic search result list into a single merged ranking. Notice how chunks that appear in both lists get promoted, while chunks that dominate only one method get discounted. Adjust the k slider: higher k makes the formula less aggressive about rewarding top-ranked results, smoothing the scoring distribution; lower k makes a #1 rank in either list more decisive.

k constant 60

Real hybrid search uses 10–50 candidates per list before fusion. This demo uses 6 per list for clarity.

When not to use RRF for code retrieval

RRF treats all ranked lists as equally authoritative. In practice you might know that for a particular query type — say, looking up a config constant by exact name — BM25 should dominate. Some teams implement weighted linear combination of normalized scores (min-max or z-score) instead of RRF, using a lightweight query classifier to set weights. BM25-heavy for exact identifier lookup; dense-heavy for conceptual queries like "how does this service handle retries." This adds real complexity and you should only go there if you have enough query traffic to measure the improvement. RRF with k=60 is a surprisingly robust default.

Reranking: The Final Sorting Pass

After hybrid search and RRF, you have a list of, say, 20 candidate chunks. You want to inject only the top 3–5 into the prompt. The question is whether the RRF ranking is good enough to trust for this final cut, or whether a second, more expensive sorting pass is worth running.

This is where cross-encoder rerankers come in. The retrieval methods used so far — both BM25 and dense embedding search — are bi-encoders: they encode the query and each document independently, then compare the resulting vectors. This is fast because the document embeddings are precomputed. But it means the query and document never interact during encoding — the model cannot see the query while deciding what aspects of the document to emphasize.

A cross-encoder takes both the query and a candidate chunk as a single concatenated input and produces a relevance score. Because both texts go through the model at the same time, the model can attend to relationships between the query and the document that a bi-encoder cannot. The accuracy improvement is often significant — but the cost is that you cannot precompute: the cross-encoder must run on every query-candidate pair at inference time.

The practical solution is a two-stage architecture: use the fast bi-encoder pipeline (dense + BM25 + RRF) to get the top 20 candidates, then run a cross-encoder on only those 20 to get a final ranking. The cross-encoder sees a small, manageable candidate set and produces a high-quality relevance ordering. The top 5 from this final ranking go into the prompt.

Cross-encoder context window limits

Cross-encoders are themselves transformer models with context window limits. A reranker fed a 2,000-token code chunk and a 500-token query is processing a 2,500-token combined input. General-purpose reranker models like ms-marco-MiniLM-L-12-v2 support 512 subword tokens — which is often enough for a single short function, but not for large class bodies or files. For retrieval pipelines that surface large chunks, use a reranker with a larger window: Cohere Rerank 3 supports 4,096 tokens, voyage-rerank-2 supports 16K. If the combined chunk+query still exceeds the limit, truncate the chunk from the bottom, not the top — the function signature and docstring are more informative for reranking than the implementation tail.

Is reranking necessary for code?

For general-purpose document retrieval, cross-encoder reranking consistently improves recall@5 by 10–20% over bi-encoder retrieval alone. For code, the benefit depends on how well your embedding model was trained on code. If you are using a strong code-specific embedding model with good AST chunking, the hybrid bi-encoder retrieval alone may be good enough for most queries. Reranking becomes most valuable when queries are ambiguous or when the codebase has many semantically similar functions that need fine-grained disambiguation. It adds latency (typically 50–200ms), so benchmark before committing.

Part 03

In Production

How Cursor Does It: A Reference Architecture

Cursor is the most architecturally transparent commercial code assistant. Its codebase indexing implementation has been publicly described in enough detail to serve as a concrete reference for the pipeline above.

Indexing

When you open a project in Cursor, it begins indexing the codebase in the background. Files are chunked locally on your machine. Chunks are then sent to Cursor's servers, where they are passed through an embedding model — either OpenAI's embedding API or a custom-trained model, depending on the feature context. The resulting vectors are stored in Turbopuffer, Cursor's vector store of choice. Metadata — file path, line ranges, and a hash of the chunk content — is stored alongside each vector. File paths are obfuscated client-side before any data leaves your machine.

Cursor caches embeddings by chunk hash. The second time you index the same codebase (or if most files haven't changed), it skips recomputing embeddings for unchanged chunks, making incremental re-indexing fast.

Query signal construction

The query is not just "what the developer typed." Cursor monitors the active cursor position and constructs a composite signal from several sources: the current file's surrounding code at the cursor position, any open editor tabs (which it weights as likely-related files), and recent edit history. This composite signal is embedded into a query vector.

Retrieval and injection

The query vector is sent to Turbopuffer, which performs ANN search, returning the top-k most similar chunk vectors. Cursor's client receives the result with obfuscated file paths and line ranges, then reads the actual code from the local filesystem. The retrieved chunks are injected into the prompt sent to the LLM. The model never directly touches the vector database — it only sees the retrieved text.

What @Codebase does

When you type @Codebase in Cursor's chat, you're explicitly triggering the full RAG pipeline: embed the query, search the codebase index, retrieve top chunks, inject into prompt. Without @Codebase, Cursor uses a lighter heuristic — open tabs, recent edits, file imports — to construct context. The @Codebase symbol is the manual override that triggers a full retrieval pass over the whole indexed codebase rather than just what's visible in your editor. Two other retrieval sources extend the same pipeline beyond local code: @Docs (searches indexed documentation) and @Web (live web search).

One important architectural note: the embedding model used to index the codebase is separate from the generative model used to produce completions. Cursor uses a lightweight, fast embedding model for indexing (optimized for latency and throughput over millions of chunks) and a larger, slower generative model for the actual completion. When building a similar system, these two components have independent optimization concerns — do not assume the same model serves both roles.

GitHub Copilot's context window and retrieval approach

GitHub Copilot's context construction is less publicly documented but follows a similar pattern. For inline completion, it uses the current file content around the cursor (the prefix and suffix of the current file) plus a Jaccard similarity heuristic to find other open tabs that share significant token overlap with the current file. The @workspace symbol in VS Code triggers a more thorough indexing-based search of the workspace, analogous to Cursor's @Codebase.

The key distinction: Copilot's default inline completion mode is a fast, low-latency path that does not run full vector retrieval on every keystroke. Full retrieval is reserved for explicit chat interactions. This is a deliberate latency-accuracy tradeoff — inline completion needs to respond in under 100ms to feel responsive; full RAG retrieval adds 200–500ms of overhead.

Tradeoffs and Limits of Code RAG

RAG breaks in specific, predictable ways. If you're building a production system, you'll hit these eventually. Here's what to expect.

Scenario	RAG Behavior	Mitigation
Cross-file dependency reasoning	Each retrieved chunk is a fragment. The model may not understand how three retrieved functions compose at the call site.	Include file path + line range metadata; retrieve parent class or module-level imports alongside function bodies.
Newly created files not yet indexed	Files added after indexing are invisible to retrieval until the index is rebuilt.	Incremental indexing on file-save events. Maintain a "pending index" queue that is flushed every N seconds.
Query is too vague	A query like "fix the bug" produces a query vector close to everything and nothing. Retrieved chunks will be generic.	Prompt the user to specify the symbol or file. Use the cursor position + surrounding error message as the primary query signal.
Minified or generated code	Minified JS, protobuf generated code, and lock files produce chunks with very low semantic density. They pollute the index and retrieve irrelevant noise.	Maintain a .gitignore-style ignore list for the RAG indexer. Exclude `node_modules`, `*.pb.go`, build directories.
Very large monorepos	Indexing millions of files takes significant time and storage. Retrieval recall degrades on very large indices unless the ANN index structure is tuned.	Scope the index to the subdirectory the developer is currently working in, or use per-service sub-indices with routing logic.
Schema / type changes	If a type changes but the embedding was computed from the old version, retrieved chunks may give the model an outdated type signature.	Invalidate embeddings on file write. Use chunk content hash to detect staleness and trigger recompute.

Does a larger context window make RAG for code obsolete?

As context windows grow to 1M and beyond — Llama 4 Scout hit 10M tokens in 2025, Gemini 1.5 Pro supported 1M — people keep asking whether RAG becomes irrelevant at some point. Worth addressing directly.

The practical answer is no, though the reasoning matters. A 200,000-line Python codebase easily exceeds 2 million tokens. Most production monorepos are far larger. Context windows are growing, but so are codebases, and they're not converging. More importantly, the attention quality degradation described in Sections 02 and 03 doesn't disappear with a larger nominal window — a 2M-token context window doesn't deliver 2M tokens of equal-quality attention. Those long-context models achieve their range through techniques like sparse attention patterns and NTK-aware RoPE scaling, which help with extrapolation but don't eliminate the position bias at extremely long ranges. And practically: a 1M-token prompt is expensive and slow even on state-of-the-art hardware. For interactive code assistance that needs to respond within a second, stuffing the full codebase is off the table regardless of window size.

Large context windows and RAG do different jobs. RAG decides what deserves to be in the context window. The context window determines how much you can fit once you've been selective. A well-tuned system retrieves the right 5,000 tokens from a 10M-token codebase and puts them in a 128K window with room left for conversation history and tool outputs. That's the correct framing.

Building Your Own Code RAG Pipeline

If you are building a coding assistant, an internal developer tool, or a code-aware agent, here is a reference stack that covers the key decisions.

Parsing and chunking

Use tree-sitter with the appropriate grammar for each language. The tree-sitter-python, tree-sitter-typescript, and tree-sitter-go grammars are mature and battle-tested. Extract function definitions (function_definition node type in Python) and class definitions as primary chunk types. For functions longer than your embedding model's context limit, split at the statement level within the function body.

# Pseudo-code: AST chunk extraction with tree-sitter (v0.21+ API)
import tree_sitter_python as tspython
from tree_sitter import Language, Parser

PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)

TARGET_TYPES = {"function_definition", "class_definition"}

def walk_tree(node, source_code: str, file_path: str, chunks: list):
    """Recursively walk the AST to catch nested definitions
    (methods inside classes, functions inside functions, etc.)"""
    if node.type in TARGET_TYPES:
        chunk_text = source_code[node.start_byte:node.end_byte]
        chunks.append({
            "text": chunk_text,
            "file": file_path,
            "start_line": node.start_point[0],
            "end_line": node.end_point[0],
            "type": node.type
        })
        # For class_definition, continue recursing to capture methods.
        # For function_definition, stop — we want the whole function,
        # not its nested helpers as separate chunks.
        if node.type == "class_definition":
            for child in node.children:
                walk_tree(child, source_code, file_path, chunks)
    else:
        for child in node.children:
            walk_tree(child, source_code, file_path, chunks)

def extract_chunks(source_code: str, file_path: str) -> list[dict]:
    tree = parser.parse(source_code.encode())
    chunks = []
    walk_tree(tree.root_node, source_code, file_path, chunks)
    return chunks

Embedding model selection

For a production code assistant, the embedding model choice matters significantly. Three strong options:

Model	Context window	Strengths	When to use
`voyage-code-3`	16K tokens	Purpose-built for code; top-ranked on code retrieval benchmarks (2025)	Production code assistant, maximum retrieval quality
`text-embedding-3-large`	8K tokens	Strong general performance; well-supported; large community	Mixed code + documentation retrieval; existing OpenAI integrations
`nomic-embed-code`	8K tokens	Open-weight; can run locally; no API cost	Air-gapped environments; cost-sensitive deployments; on-prem

Vector store

For a single-developer or small-team tool: pgvector in a local Postgres instance is often sufficient. For a service handling multiple users: Qdrant is a strong choice — it supports both dense and sparse vectors in a single collection, enabling native hybrid search without maintaining two separate stores.

BM25 index

For pure BM25, Tantivy (the Rust-based full-text search library behind Qdrant's sparse vector support) or Elasticsearch's BM25 are the standard options. For a simpler deployment, the Python rank_bm25 library is adequate for corpora under ~50,000 chunks.

The prompt injection template

Placement and formatting of retrieved context matters. A template that works well in practice:

You are a coding assistant for this codebase.

## Relevant context from the codebase:

### [payments/gateway.py · lines 42–87]
```python
{chunk_1_text}
```

### [payments/exceptions.py · lines 1–24]
```python
{chunk_2_text}
```

### [payments/models.py · lines 88–112]
```python
{chunk_3_text}
```

## Current task:
{user_request}

Including file path and line numbers in each chunk header gives the model two useful signals: the module structure of the codebase, and the ability to reference specific locations in its response. These headers cost very few tokens but significantly improve the quality of generated code that needs to import from or reference the retrieved files.

Do not retrieve more than you need

It is tempting to inject 10–15 retrieved chunks to "give the model more information." Resist this. Each additional chunk increases the context size (paying the quadratic cost discussed in Section 4), increases the probability that the model attends to irrelevant material, and reduces the proportion of the context that is highly relevant. In practice, 3–5 high-quality chunks typically outperform 15 lower-quality ones. Invest in retrieval quality, not retrieval quantity.

The Through-Line

The surprising thing about attention dilution is that it isn't a bug you can patch. It's a structural property of softmax normalization — the total attention weight sums to 1.0 regardless of sequence length, so every token you add is competing with every other for a share of that budget. More context doesn't mean more understanding; it means each fact gets a smaller slice. The lost-in-the-middle position bias makes it worse: code injected into the middle of a long prompt is structurally disadvantaged by both RoPE's distance decay and the recency bias that causal pretraining instills. Knowing this changes how you think about the whole problem.

RAG doesn't solve attention dilution — it sidesteps it. Instead of sending everything and hoping the model finds what's relevant, it figures out what's relevant first and sends only that. The context window ends up containing what actually matters for the task: the right type definitions, the right helper functions, the right error handling patterns. The model has a real shot at using them.

In practice: below roughly 3,000–5,000 lines, context stuffing usually works well enough. You can fit the most relevant files in a 32K window and the model finds what it needs. Above that, the problems stack up fast. At 50,000+ lines, naive stuffing reliably hurts. At 500,000+ lines, it breaks down completely — the most relevant file is just one voice among hundreds. At that scale, AST chunking, hybrid BM25 + dense retrieval, RRF fusion, and careful prompt injection aren't premature optimization. They're the baseline.

RAG alone does not solve the session boundary problem

Code RAG retrieves relevant chunks within a session. But when a session ends, the AI's working understanding of the codebase — which files matter, which edge cases were found, what is still missing — is gone. The next session re-discovers from scratch. A two-phase benchmark run against the Apache Camel codebase (5,856 files, unfamiliar enterprise Java) demonstrated this concretely: the vanilla agent spent 51 tool calls re-exploring in Phase 2 and produced 0 bytes on one task. The same task with cross-session memory (structured notes stored during Phase 1 research, recalled at Phase 2 start) produced a complete 5-file implementation at −58% Phase 2 cost. Code RAG and working memory are complementary: RAG retrieves the right code at query time; memory preserves what was learned across the session boundary.

References & Further Reading

Foundational Papers

Attention Is All You Need — Vaswani et al., 2017. NeurIPS. The original transformer paper introducing scaled dot-product attention.
Lost in the Middle: How Language Models Use Long Contexts — Liu et al., 2023. Stanford / Berkeley. Empirical study of U-shaped attention bias and the 30% accuracy drop at mid-context positions.
Found in the Middle: Calibrating Positional Attention Bias — He et al., 2024. UW / MIT / Google. Proposed calibration method that partially corrects RoPE position bias at inference time.
Retrieval-Augmented Code Generation: A Survey — 2025. Comprehensive survey of RAG approaches specifically for code generation and repository-level tasks.

RAG & Retrieval

RAG vs Large Context Window: Real Trade-offs for AI Apps — Redis Engineering Blog. Practical comparison of when RAG outperforms long-context stuffing in production systems.
Context Window Optimization: Why Ranking, Not Stuffing, Is the Scaling Law for Agents — Shaped AI. Argues that quality of context selection, not quantity, is the leverage point for agent performance.
RAG for LLM Code Generation using AST-Based Chunking — Vishnudhat Natarajan, Medium. Practical walkthrough of tree-sitter based chunking for Python codebases.
Better Retrieval Beats Better Models for Large Codebases — Stéphane Derosiaux. AST chunking and hierarchical indexing outperform model scaling when retrieval is the bottleneck.

Code Assistants & Architecture

How Cursor Actually Indexes Your Codebase — Towards Data Science. Detailed reverse-engineering of Cursor's RAG indexing architecture, Turbopuffer usage, and privacy model.
How GitHub Copilot Works — Quastor Engineering. Technical breakdown of Copilot's context assembly, Jaccard-similarity open-tab heuristic, and prompt construction.
What is Retrieval-Augmented Generation? — GitHub Blog. GitHub's official explanation of RAG applied to code and documentation retrieval.
Why Cursor, Claude Code, and Devin Use grep, Not Vectors — MindStudio. Argues the case for structured search (grep, AST navigation) over pure semantic retrieval for code agents. A useful counterpoint.

Hybrid Search & Ranking

BM25 vs Dense Retrieval for RAG: What Actually Breaks in Production — Ranjan Kumar. Empirical breakdown of when BM25 outperforms dense retrieval and vice versa, with production failure examples.
Hybrid Search: BM25 and Dense Retrieval Combined — Michael Brenndoerfer. Interactive explainer of RRF and weighted score combination for hybrid search.

The Naive Approach: Just Send Everything

What the model actually receives

Why Attention Dilutes: The Math

Softmax sharpening and temperature

Lost in the Middle: Position Bias

Why does this happen architecturally?

The Quadratic Cost Problem

RAG at a Glance: The Core Idea

Chunking for Code: Why Fixed-Size Fails

Fixed-size chunking failure mode for code RAG

AST-based chunking

Sliding-window augmentation

Retrieval Strategies: Dense, Sparse, and Why Code Needs Both

Dense retrieval: semantic code search with embeddings

Sparse retrieval: BM25 keyword search for exact identifiers

Hybrid Search and Reciprocal Rank Fusion

Reciprocal Rank Fusion (RRF)

When not to use RRF for code retrieval

Reranking: The Final Sorting Pass

How Cursor Does It: A Reference Architecture

Indexing

Query signal construction

Retrieval and injection

GitHub Copilot's context window and retrieval approach

Tradeoffs and Limits of Code RAG

Does a larger context window make RAG for code obsolete?

Building Your Own Code RAG Pipeline

Parsing and chunking

Embedding model selection

Vector store

BM25 index

The prompt injection template

The Through-Line

References & Further Reading

Related reading