Building Vectr — Part 1 of 3
Why grep fails when you don't know the keywords — and how AST chunking, hybrid BM25+vector search, and a symbol graph fix it.
You get dropped into an unfamiliar codebase. Not a toy project — real production code, 8,000 files, three years of accumulated complexity and clever abstractions. Your job is to fix a bug in the request validation pipeline. What does an AI code editor do next?
This post is about a problem I kept running into, a tax I kept paying, and the indexing system I built to eliminate it. It covers the technical decisions behind Vectr's search layer: why naive chunking produces bad embeddings, how tree-sitter solves the code-parsing problem, what BM25 does that vector search can't, and why you need a symbol graph for questions that text search cannot answer at all.
The Re-discovery Tax
If you're a human engineer navigating an unfamiliar codebase, here's what you probably do: you ask someone who knows it, or you grep for the error message, or you open the entry point and follow imports until you find the thing. Your brain does semantic compression the whole way — building a model of the system, discarding noise, following intuitions about where complexity tends to live. By the time you've read 20 files, you have a rough map that persists across days and sessions.
An AI code editor has the same tools — read files, run shell commands, grep — but completely different economics. Every Read call costs tokens. Every Bash call for grep costs a turn. Unlike a human who can skim-read at 1,000 words per minute and discard irrelevant content almost for free, an AI editor pays full price for every character it reads: it sits in the context window whether or not it was useful. Read the wrong 500-line file and you've burned context that could have held the answer.
The result, on unfamiliar codebases, is what I started calling the re-discovery tax: a cluster of navigation calls at the start of every session, before any actual implementation begins, spent on figuring out where things are. And because AI editors have no persistent memory between sessions, they pay this tax again and again — every session, on the same codebase.
In benchmarks I ran against real open-source codebases (more detail in Part 3), the re-discovery tax on CPython internals ranged from 6 to 23 tool calls per task before the first file write. Some sessions spent more turns navigating than implementing.
The re-discovery tax is paid every session, not once. A human engineer's mental map of a codebase accumulates and compounds. An AI editor's map is fully rebuilt from scratch at the start of each session. The economic gap widens as the codebase grows.
Why grep Fails at the Boundary of Your Knowledge
Before explaining what I built, I want to be precise about where grep breaks down — because "just use grep" is the natural reaction, and it's not obviously wrong until you try to use it systematically on unfamiliar code.
grep is a brilliant tool for confirming hypotheses you already have. If you know what you're looking for, it's nearly perfect. The problem is the case that isn't really an edge case: you don't know what you're looking for.
Say you're trying to understand how a Django application validates incoming JSON payloads before they hit the ORM layer. You might grep for validate. You'll get 200 results across 40 files — field validators, form validators, configuration validators, test fixtures. None of them are obviously the thing you want. You grep for json.loads. You get 30 results. You grep for request.data. That gets you closer, maybe. But you spent four greps and 15 minutes before you found the right file.
The deeper problem: grep requires you to already have a mental model of the codebase's naming conventions. An AI editor running on an unfamiliar codebase doesn't know whether payload validation is called validate_payload, check_request, parse_input, or _pre_process. It can guess, run four greps, and maybe find it. Or it can read files top-down until it finds something relevant. Both strategies are expensive in tokens and turns.
Think of keyword search as asking for directions by street name in a city you've never visited. "Where is Maple Street?" gets a precise answer. But "where is the street with the good coffee shop near the park?" — keyword search has nothing to offer. You need a different kind of index: one that understands what places are for, not just what they're called.
Semantic search inverts this. It maps your query and every code chunk into the same high-dimensional vector space, then finds the chunks closest to your query by meaning — regardless of whether they share any words. "JWT validation logic" finds verify_token even if neither of those words appears in the function body.
That's the core idea. But making it work on real code — not toy examples — requires solving several problems that aren't obvious from the outside. The rest of this post is about those problems and the design choices that resolve them.
The Chunking Problem: Why Line Windows Break on Code
Prose text has a natural unit of meaning: the paragraph. You can split a Wikipedia article into 200-word chunks, embed each one, and get a reasonable search system. Code doesn't work this way.
The standard naive approach for code indexing is the same line-window strategy borrowed from document search: take a sliding window of N lines with M lines of overlap, create a chunk, embed it, move the window. A common default might be 150-line windows with 50 lines of overlap. Simple, language-agnostic, works on any file format.
The problem is what happens at the window boundaries. Consider this function:
def process_workspace_changes(
path: Path, db: Database, *, force: bool = False
) -> list[ChangeResult]:
"""Process all pending changes in a workspace, optionally forcing re-indexing."""
pending = db.get_pending_changes(path)
if not pending and not force:
return []
results = []
for change in pending:
if change.kind == ChangeKind.DELETED:
db.remove_chunks_for_file(change.file)
results.append(ChangeResult(file=change.file, status="removed"))
elif change.kind in (ChangeKind.MODIFIED, ChangeKind.CREATED):
chunks = chunk_file(change.file, db.language_for(change.file))
db.upsert_chunks(chunks)
results.append(ChangeResult(
file=change.file, status="indexed", chunk_count=len(chunks)
))
db.mark_changes_processed(path)
return results
If a 150-line window happens to cut through this function — say, the signature and first few lines fall in chunk N and the body falls in chunk N+1 — neither chunk is independently meaningful. The chunk with just the body is missing the parameter names and return type. The chunk with just the signature has no implementation context. The embedding of a half-function is significantly worse than the embedding of the complete thing — it cannot capture what the function actually does, because it doesn't have the full picture.
The fix is obvious in retrospect: split at semantic boundaries. Functions should be complete units. Classes should contain their methods, or each method should be its own chunk with the class header prepended for context. The fundamental question — "what is the smallest independently meaningful unit of code in this language?" — has a clear answer: it's the function or method. But extracting those boundaries programmatically requires actually parsing the code, not pattern-matching on indentation or blank lines.
An embedding model compresses everything in its context into a single fixed-size vector. A complete function gives the model everything it needs to capture the function's purpose, parameters, return behavior, and side effects in that vector. A half-function forces the model to compress an ambiguous fragment whose meaning depends on context it doesn't have. The resulting vector is a blurred average of possible interpretations — it matches nothing precisely.
Parsing Code with tree-sitter
tree-sitter is a parser library that produces concrete syntax trees for source code — richer than what people sometimes loosely call an AST, but the key property is the same: every construct in the language has a named node with exact byte boundaries in the source. Unlike a regex-based approach that pattern-matches common function signatures, tree-sitter actually parses the grammar of each language and handles edge cases correctly: nested functions, decorators on multiple lines, multiline function signatures, arrow functions in JavaScript, generic bounds in Rust, anonymous classes in Java.
For each supported language, tree-sitter grammars define the language's syntactic structure as a formal grammar. At indexing time, Vectr runs a language-appropriate query against each file's syntax tree to find all function and class definitions. For Python, the query looks like this:
(function_definition
name: (identifier) @name
parameters: (parameters) @params
body: (block) @body) @function
(class_definition
name: (identifier) @name
superclasses: (argument_list)? @bases
body: (block) @body) @class
This query matches any function or class definition anywhere in the file — at the module level, inside a class, inside another function — and captures the name, parameters, and body as named nodes with precise byte-range positions in the source. You can then slice the original source file at those byte positions to extract complete, syntactically valid chunks.
The practical result: instead of chunks that contain partial functions and random intermediate lines, every chunk is a syntactically complete, independently meaningful unit of code. For classes, Vectr attaches the full class signature — including the base class list captured by @bases — as a header to each method chunk. So the chunk for WorkspaceLock.acquire() includes not just the class name but its inheritance context, which matters for search: a method of AuthenticatedView(LoginRequiredMixin, View) has a meaningfully different semantic context than a method of a plain View.
What about languages without tree-sitter grammars?
tree-sitter has grammars for most major programming languages: Python, JavaScript, TypeScript, Java, Go, Rust, C, C++, Ruby, PHP, and many more. For languages where grammars aren't available — rare proprietary DSLs, some configuration formats that don't parse as a programming language — Vectr falls back to a line-window approach with larger windows and more overlap. The quality is lower, but coverage is maintained. In practice, the fallback is invoked for fewer than 2% of files in any codebase I've indexed.
AST-aware chunking breaks down for functions that are genuinely enormous — 500+ lines, not uncommon in legacy code or generated parsers. A single function that large will produce an embedding that tries to capture too many things at once and captures none of them well. Vectr handles this by further splitting large functions at their major control-flow boundaries (loop bodies, large conditional branches) when the function exceeds a configurable line threshold (default: 200 lines). The resulting sub-chunks each include the function signature as a header to preserve context. Their embedding quality is better than one giant embedding, though still lower than a naturally small function — the model is at the edge of what it can compress into 768 dimensions.
This chunking strategy — complete syntactic units, class headers attached to method chunks, fallback for unparseable languages — is the single biggest quality improvement in Vectr's indexing pipeline compared to a naive line-window baseline. Everything downstream (embedding quality, search relevance, working memory precision) depends on the chunks being meaningful units.
Code-Specific Embeddings Running Locally
Not all embedding models are equally good at code. Models trained primarily on prose text — blog posts, books, Wikipedia — have learned representations of natural language semantics. They encode meaning in terms of topics, concepts, and linguistic relationships. Code has different regularities: symbol names, type signatures, control flow patterns, API call chains.
A general-purpose embedding model will encode def authenticate_user(token: str) -> Optional[User] and "user login validation" somewhat close together in vector space — but not as close as a model trained on millions of similar functions, which has learned the specific patterns that code uses to express authentication logic. The difference in retrieval quality on code search benchmarks is significant: code-aware models routinely outperform general-purpose models by 10–20% on tasks like "find the function that handles X."
Vectr uses Snowflake/snowflake-arctic-embed-m-v1.5, a 110-million-parameter model that produces 768-dimensional embedding vectors and runs in under 100ms per batch on a modern laptop CPU. It is optimized for retrieval tasks across both natural language and code, and its size keeps local inference practical without sacrificing too much quality.
Two practical constraints drove this choice. First, cost: a tool that fires 20–50 search calls per session would accumulate non-trivial API costs quickly — especially for power users running it all day. Local inference is free at query time after the one-time model download. Second, and more importantly: a lot of codebases cannot be sent to third-party APIs. Internal tools, proprietary algorithms, customer data models, security infrastructure — many organizations have policies (or contractual obligations) that prohibit sending source code to external services. Local inference removes this constraint entirely, making Vectr usable in environments where a cloud API is not.
The tradeoff is a one-time setup cost: the model weighs roughly 440MB and needs to be downloaded on first run. This is a real friction point — some users hit it and bounce. It's a known issue. The alternative (mandating API access) would exclude a large fraction of the most valuable use cases.
One detail worth noting about how the embedding is structured: the query and each code chunk are embedded in asymmetric modes. Queries use a Represent this query for searching relevant code: prefix, while code chunks are embedded with an Represent this code snippet: prefix. This is a property of arctic-embed-m's training regime — the model is a single encoder (not two separate models), but it was trained with different input prefixes for query-side and document-side inputs so that the resulting vectors occupy compatible but distinct regions of the same embedding space. Using the wrong prefix reduces the cosine similarity between semantically related query-chunk pairs — the vectors for "user authentication" and verify_token end up further apart in embedding space than they should be. Using the wrong mode measurably degrades retrieval quality.
arctic-embed-m is a single encoder trained with different input prefixes for queries vs. documents. Embedding a query without its prefix, or a chunk with the query prefix, produces vectors that are technically in the same 768-dimensional space but oriented incorrectly relative to each other — the cosine similarities you compute will not reflect true semantic similarity. Always check the model card for the exact prefix strings. Getting this wrong costs 5–15% in retrieval quality, which is invisible until you run a structured benchmark.
Hybrid Search: Why BM25 and Vector Search Need Each Other
Vector search finds chunks that are semantically similar to your query. It handles natural language queries ("JWT validation"), concept queries ("connection pooling logic"), and paraphrases well. But it has a well-known weakness: it handles exact symbol names poorly.
If you search for _handle_workspace_lock_conflict — an exact function name — a vector search might not rank it first. The embedding of that function name is just one point in a crowded neighborhood of similar-looking function names in the vector space. They all compete, and the margin between them is small. Meanwhile, BM25 will find it immediately — it's a keyword search algorithm, and exact string matches get the highest possible score.
The inverse is also true: BM25 cannot find "retry logic with exponential backoff" if the function is called _schedule_attempt_with_delay and its docstring says nothing about backoff. Zero keyword overlap means zero BM25 score. Vector search finds it because the semantic cluster it belongs to is close to the query in embedding space.
The right system uses both. Every query in Vectr runs both a vector search and a BM25 search in parallel, then combines the two ranked lists using a weighted formula. The weight assigned to each approach depends on a characterization of the codebase and query context:
| Situation | BM25 weight | Vector weight | Rationale |
|---|---|---|---|
| Large unfamiliar codebase, first session | Low (0.2) | High (0.8) | You don't know the naming conventions yet. Semantic matching is more likely to find the right function than an exact-name guess. |
| Small codebase you know well | High (0.7) | Low (0.3) | You probably know the symbol names. BM25 is faster and more precise on exact lookups. Semantic fuzziness adds noise. |
| Explicit symbol name in query | High (0.8) | Low (0.2) | If the query contains what looks like a symbol name (camelCase, snake_case, PascalCase), weight BM25 heavily — you probably mean that exact symbol. |
| Natural language concept query | Low (0.2) | High (0.8) | No exact symbol to match. Vector search handles concept queries well. BM25 will mostly return irrelevant exact matches on common words. |
BM25 scores a query Q against a document D as a sum over query terms:
score(D, Q) = Σᵢ IDF(qᵢ) · [ tf(qᵢ, D) · (k₁ + 1) ] / [ tf(qᵢ, D) + k₁ · (1 − b + b · |D| / avgdl) ]
Where tf(qᵢ, D) is the count of term qᵢ in document D (term frequency). IDF(qᵢ) is the inverse document frequency — how rare the term is across all N documents in the corpus:
IDF(qᵢ) = log( (N − nᵢ + 0.5) / (nᵢ + 0.5) )
where nᵢ is the number of documents containing qᵢ. This is the Robertson–Sparck Jones variant; some implementations add +1 inside the log to prevent negative IDF values on terms that appear in more than half the corpus. |D| is the length of the document in tokens. avgdl is the average document length across the corpus. k₁ (typically 1.5) controls term-frequency saturation: doubling tf does not double the score, which prevents a document that repeats a query term 100 times from dominating. b (typically 0.75) controls length normalization: a long document with one occurrence of a term gets penalized relative to a short document with the same occurrence.
The key insight: BM25 rewards short, dense documents with exact query terms. A tight, well-named 20-line function that contains the exact symbol you searched for will score higher than a 200-line file that mentions the term incidentally. This is exactly what you want for code search.
This combination isn't a magic formula. It's an approximation that works better than either approach alone in practice. The benchmark on the Apache Camel codebase (58,000+ Java files) showed a 73% reduction in Read+Bash navigation calls compared to the baseline AI editor with no index. A purely semantic search would have missed exact class names that had already been established as part of the task context. A purely keyword search would have missed the conceptual queries that came next. The specific weights in the table above are the values actually used in Vectr's implementation — they are a starting point, not a theoretically derived optimum. I tuned them against the benchmark dataset and they haven't been revisited since.
The codebase characterization heuristic — large + unfamiliar = semantic-heavy, small + familiar = BM25-heavy — is computed from file count, total token count, and how recently the index was last queried. It's a proxy for "how likely is it that the user knows the naming conventions?" A better signal would come from explicit feedback (did the user click through on the result?), but that requires session tracking that I haven't built yet.
The Symbol Graph: What Text Search Cannot Answer
Semantic search and BM25 handle "find me the code for this concept" well. But there's a different navigation pattern that neither handles: "find me everything that calls this function."
Call graphs are a fundamentally different kind of knowledge. They describe relationships between code entities — which function calls which, which class imports which module, which HTTP route maps to which handler. You can't answer "who calls process_workspace_changes?" with a text search, because the callers don't contain that function's body — they just contain a call to it by name, somewhere in their own bodies.
Vectr builds a symbol graph during indexing. For each file, tree-sitter extracts four kinds of information:
What the symbol graph captures
| Graph element | What it captures | Example query enabled |
|---|---|---|
| Definitions | Every function, class, method, and module-level constant defined in the file, with name and line number | vectr_locate("WorkspaceLock") → resolver.rs:214 |
| Call edges | Every call site, mapping callee name to the calling function's context | vectr_trace("acquire_lock") → callers + callees |
| Import edges | Every import statement, mapping the imported symbol to its likely source module | Resolve where from workspace import LockManager comes from |
| HTTP routes | Flask/FastAPI @router.get(), Express app.post(), Spring @GetMapping — extracted as named symbols |
vectr_locate("POST /api/workspace") → handler function |
The resulting graph enables exact lookups that are categorically different from text search. vectr_locate("WorkspaceLock") returns a file path and line number in under 10ms — no embedding, no ranking, pure symbol table lookup. vectr_trace("acquire_lock") returns all callers and all callees in one round-trip. These are not search results — they are graph traversals, and they produce exact answers rather than relevance rankings.
HTTP route indexing deserves a note: treating GET /api/workspace/{id}/status as a first-class named symbol means an AI editor can navigate directly from an API endpoint to its handler — a common navigation need that would otherwise require grepping for the route string across all route definition files.
These are not competing approaches — they answer different questions. "Find code that does X" is a search problem. "Find who calls Y" or "find where Z is defined" is a graph traversal problem. A good code navigation system needs both. Relying only on text search for definition lookups is like looking up a phone number by describing the person rather than looking them up by name.
Six Fallback Strategies in vectr_locate
In practice, users don't always know the exact symbol name. Maybe you know it as WorkspaceLockManager but the actual class is WorkspaceLock. Maybe you know the module but not the fully qualified name. vectr_locate runs six fallback strategies in sequence, stopping at the first match:
-
01
Exact match Direct lookup in the symbol table. Sub-millisecond. If this succeeds, the other strategies are not run. Confidence: highest.
-
02
Suffix match
LockmatchesWorkspaceLock,AcquireLock,LockManager. Useful when you remember a class's core name but not its module-scoped prefix. Returns all matches if multiple exist. -
03
Same-module priority If a caller file is provided (optional context), search definitions within the same module first. Captures the common pattern where you follow an import to a neighboring file.
-
04
Unique name If there is exactly one symbol across the entire codebase whose name contains your query string, return it. Useful for rare internal symbols that have no exact-match competition.
-
05
Import chain follow Follow import statements from a given file to find where the name likely comes from. Useful for resolving third-party library re-exports and module aliases.
-
06
Fuzzy (Levenshtein ≤ 2) Levenshtein edit distance ≤ 2 across all symbol names. Catches typos and near-misses. Lowest confidence — the caller should verify before acting on a fuzzy result.
Each strategy produces a LocateResult with a resolution_strategy field, so the caller knows how confident the match is. An exact match means you can act on the result immediately. A fuzzy match with edit distance 2 means you should verify before relying on it. This distinction matters in practice — a silent wrong navigation is worse than no navigation at all.
An AI editor that acts on a fuzzy match without knowing it's fuzzy will navigate to the wrong function and either read irrelevant code (wasted tokens) or, worse, make a change to the wrong function (introduced bug). Surfacing the confidence level is not a nice-to-have — it's a correctness requirement.
mtime Cache and Incremental Re-indexing
The first time you run vectr start on a large codebase, indexing takes time. CPython's 4,000+ Python and C files: about 8 minutes on a modern laptop. Django's ~1,800 Python files: about 2 minutes. Apache Camel's 58,000+ Java files: closer to 45 minutes.
You obviously can't re-index the entire codebase every session. The solution is incremental re-indexing with an mtime cache.
During initial indexing, Vectr writes a file at ~/.cache/vectr/{hash}/index_cache.json that stores the modification timestamp of every indexed file. On subsequent vectr start calls, Vectr checks each file's current mtime against the cached value. Only files that have changed since the last index run are re-indexed. On a typical active development session where you've modified 5–10 files, subsequent re-indexing takes under 5 seconds.
The {hash} in the cache path is a short SHA-256 hash of the absolute workspace root path — just enough characters to make two different paths produce different cache directories with overwhelming probability. Each workspace gets its own isolated cache directory, so you can run Vectr on multiple projects simultaneously without cross-contamination.
Handling deletions
The tricky part of incremental re-indexing is deletion. If a file is removed, it has no mtime to check against — it simply isn't there. Vectr handles this by also storing the complete set of indexed file paths in the cache. At startup, it diffs this set against the current file tree and removes all chunks belonging to deleted files before re-indexing any modified ones. The order matters: process deletions first, then updates, then new files. This prevents a renamed file from leaving orphaned chunks in the index.
The watchdog listener
During an active session, Vectr runs a watchdog listener on the workspace root. When a file is saved, the listener queues it for re-indexing in the background. You don't notice this happening — the next vectr_search call after a file save will return results that reflect the change. This closes the loop between editing and searching: you refactor a function, save, and subsequent searches return the updated version.
The re-indexing threshold is conservative: any write event to a source file not excluded by .vectrignore triggers re-indexing of that file's chunks, even for a one-line change. In practice, most editors issue multiple rapid write events per save (the editor writes, then a formatter rewrites, then the LSP may write again). The watchdog listener debounces these at 300ms — only the last write in a burst of events within 300ms actually triggers re-indexing. Without debouncing, a single save in a project using aggressive auto-formatting would trigger 3–5 redundant re-index operations. You'd rather spend 200ms re-indexing one file once than return stale results on a query that matters.
.vectrignore: Keeping the Index Clean
Not everything in a repository should be indexed. Generated files, binary assets, vendor directories, build outputs, test fixtures — these consume index space and add noise without contributing useful search results. A search for "authentication handler" should not return results from node_modules/ or minified vendor JavaScript.
Vectr reads a .vectrignore file from the workspace root using glob patterns. The syntax follows .gitignore conventions — trailing slash for directories, * for single-level wildcard, ** for recursive match (implemented via Python's pathlib.Path.match()) — but Vectr does not implement the full gitignore specification: the ! negation prefix is not supported. If you already have a well-tuned .gitignore, you can copy most of it directly. A typical .vectrignore for a mixed-stack project:
vendor/
node_modules/
dist/
build/
*.pb.go # generated protobuf Go files
*.min.js # minified JavaScript
__pycache__/
.venv/
coverage/
*.snap # Jest snapshots
migrations/ # Django database migrations
There's also a --exclude flag on vectr init for quick one-off exclusions without creating a file.
The practical importance of this: a codebase with node_modules/ in its tree will typically contain 5–20x more code from installed packages than from the project itself. Every search result page would be dominated by library code you didn't write and don't want to navigate. Excluding vendor directories is the single most impactful configuration change most users can make, and it should be done before the initial index run to avoid having to rebuild from scratch.
If you run vectr init on a project with node_modules/ before setting up .vectrignore, you'll index those directories and they'll bloat the vector store significantly. The fix is to delete the cache at ~/.cache/vectr/{hash}/ and re-run vectr init with the ignore file in place. It's not catastrophic — it's just wasteful. The startup warning Vectr emits when it detects large vendor-looking directories without a .vectrignore exists precisely to prevent this.
What Actually Happens When You Call vectr_search
Let's make the full pipeline concrete. When an AI code editor calls vectr_search("workspace lock acquisition and release"), here is the complete sequence of operations:
1. Query string is embedded using arctic-embed-m with query prefix
→ 768-dimensional float vector, ~15ms on CPU
2. Vector similarity search against ChromaDB store
→ Top-20 chunks by cosine similarity, with scores
3. Same query runs through BM25 index (rank-bm25, in-memory)
→ Top-20 chunks by BM25 score, with scores
4. Two ranked lists are merged
→ Weight BM25/vector based on codebase characterization
→ Normalized scores combined; top-5 results selected
5. Symbol names in the query are detected (camelCase, snake_case, PascalCase)
→ If found: also run vectr_locate as a side channel
→ Merge symbol lookup results into final output if relevant
6. Final top-N chunks returned with:
file path, start line, end line, matched text, search method
(default N=5, configurable via --top-k flag on vectr start)
The result for the query above might look like:
[1] resolver.rs:214 — WorkspaceLock::acquire()
Acquires the workspace-scoped lock. Blocks if another process holds it.
Stores PID and timestamp in lock file at workspace_root/.vectr_lock.
[2] resolver.rs:267 — WorkspaceLock::release()
Releases the workspace-scoped lock. Validates that the current process
holds the lock before releasing (returns Err if not held).
[3] workspace.py:89 — _acquire_workspace_lock(path)
Context manager: acquires, yields, releases on exit.
Instead of reading 15 files to find these three functions, the AI editor reads one search result. Navigation time drops from 15 calls to 1. Across a full task session, this compounds: the re-discovery tax on CPython internals dropped from 6–23 tool calls per task to 1–3 in the benchmarks I ran with the index active.
Design Decisions I'd Make Differently
Every system has decisions that looked reasonable at design time and only reveal their cost in production. Here are the three I'd revisit.
The Python 3.14 requirement
Vectr currently requires Python 3.14. The core reason is that the codebase uses match/case structural pattern matching extensively, and some asyncio event loop patterns that technically work on 3.10+ but behave differently in ways we hit in practice. In retrospect, 3.11 would probably work with a few hours of refactoring. The 3.14 requirement has been the single biggest adoption friction — a significant fraction of engineers who encounter it close the tab rather than update their Python version. This is worth fixing.
ChromaDB as the vector store
ChromaDB — a vector store that handles embedding persistence and similarity search — works fine, but it's heavyweight for what we actually need. The full HNSW index with persistence, the Python client layer, and the inter-process communication overhead add about 200ms specifically to ChromaDB's startup contribution — not total Vectr startup, which is ~280ms including mtime diffing and watchdog initialization. For a tool invoked at the start of every session, that's noticeable. For v2, I'd consider a lighter in-process option — possibly a purpose-built flat index with ANN search for codebases under 100k chunks, falling back to HNSW only for larger ones.
The BM25 implementation
The rank-bm25 library is pure Python and fast enough for codebases under 50,000 chunks. Beyond that, it starts to show latency — the in-memory index grows large and query time climbs past what's acceptable for interactive search. The right long-term solution is probably integrating BM25 scoring directly into the vector store query pipeline, or using a tokenized inverted index instead of the full in-memory approach. For current use cases, where most codebases are under 20K chunks, it's fine. But it's a known scale ceiling.
The indexing layer is the foundation, not the product. What it enables is an AI code editor that can navigate a large unfamiliar codebase as efficiently as a human engineer who has worked in it for months — finding the right functions in one or two calls instead of fifteen, without any accumulated session-over-session experience.
But "without session-over-session learning" is the key clause. The index tells you where things are in the codebase. It doesn't tell you why things are the way they are — the non-obvious invariants, the patterns that emerge from reading 50 files, the bugs that were fixed by changing two lines in a place that looks completely unrelated. That kind of knowledge isn't in any single function. It's constructed from connections across many functions, and it doesn't survive a session boundary.
That's the problem Part 2 addresses. It covers the working memory layer: a note store where an AI editor can save findings in structured, tagged form — "the lock acquisition logic is at resolver.rs:214, and it acquires an exclusive file lock using fcntl.flock, not a threading primitive" — and retrieve them in under 50ms at the start of any future session. The form matters: it's not a pointer to the file, it's the synthesized understanding itself. When /compact runs and replaces the conversation with a summary, exact signatures and line numbers evaporate — but notes don't. The indexer tells you where to look. The working memory layer tells you what you already know about what you found.
- The Probabilistic Relevance Framework: BM25 and Beyond
- snowflake-arctic-embed-m-v1.5 model card
- cAST: Enhancing Code RAG with Structural Chunking via AST