Building Vectr — Part 3 of 3
What the benchmark numbers actually mean — and the methodology choices that decide whether a headline number is honest or a lie you told yourself.
A benchmark number without context is worse than no number at all. No number leaves you uncertain. A number without context leaves you confidently wrong — which is a strictly worse place to be.
The most dangerous move when you evaluate a developer tool is to run a benchmark and post the headline figure with no account of what it measures, what it ignores, and where the methodology quietly assumes things that won't hold for the person reading it. I've watched this fail in both directions. A tool advertises "+40% productivity" on a task so narrow the tool basically can't lose. Another reports "+3% cost reduction" while its own overhead is folded into the denominator, hiding the per-task savings that were the whole point.
This is the third and final post in the Building Vectr series. Part 1 built the semantic index — finding things in a codebase by meaning instead of keyword. Part 2 built the working memory that survives /compact and session boundaries. Now comes the part where I have to find out if any of it was worth building. I'll walk through exactly what Vectr was measured on, the calls I made about methodology, where those calls are shaky, and — the part most write-ups skip — what the numbers genuinely do not let you conclude.
The Re-Discovery Tax
The hypothesis is simple. An AI code editor working on an unfamiliar codebase spends a real fraction of its time and token budget on navigation — finding where things are — rather than on implementation. Cut the navigation cost and you cut total session cost and time without touching output quality.
So the thing to measure is the re-discovery cost per implementation task: how much does the agent spend figuring out where things live before it starts writing code?
That is not the same as total session cost. Total session cost is everything — navigation, reasoning, writing, testing, verification. If your tool kills navigation overhead but adds its own overhead in search and recall calls, you have to measure both sides or you don't know whether the net is positive. This is the most common way benchmarks for memory tools mislead: they measure one side and report it as if it were the whole ledger.
The metric I settled on: Read + Bash calls before the first file write. I call it the re-discovery tax. It is not perfect. A sharp agent might read only a couple of large architectural files and then implement fast; a weaker one might read many small files and still be lost. But as a first-order proxy for "how much did you spend orienting versus doing the work," it captures the thing I care about, and it has the virtue of being unambiguous to count from a session transcript.
Secondary metrics: total session cost in dollars (from API pricing), wall time, and total turn count. These triangulate the primary metric — if re-discovery drops but cost doesn't, something is off.
Hire a plumber for an unfamiliar old house. The first hour is rarely plumbing. It's finding the shutoff valve, tracing which pipe feeds which fixture, working out why the previous owner ran a line through the wrong wall. The re-discovery tax is that first hour. A plumber who has the building's plans in hand skips most of it and starts cutting pipe. Vectr's bet is that a good index plus stored findings is the building's plans — and the benchmark is asking how much of that first hour it actually removes.
The Benchmark Design: CPython and Six Tasks
Why CPython
I picked CPython for four reasons. It's a real production codebase, not a toy — roughly 4,000+ files spanning C, Python, and a little assembly. It's public, so there's no IP problem in using it as benchmark material. It's genuinely unfamiliar to most people: even seasoned Python engineers don't spelunk through CPython internals daily, so the "unfamiliar codebase" condition is satisfied for free. And the tasks need an understanding of non-obvious system behavior rather than pattern-matching on the obvious — which is exactly where navigation overhead runs highest.
I also ran earlier benchmarks on Apache Camel (Java, 5,856 files) to check the results weren't a CPython artifact. Camel came out directionally similar and actually a bit stronger — probably because it's larger and its naming conventions are less guessable than CPython's reasonably disciplined ones.
The six tasks
Six tasks chosen to span different kinds of implementation work — two bug investigations and four feature additions, with three of them designed to test cross-session recall:
| Task | Type | What it involves |
|---|---|---|
debug_gc_finalizer | Bug investigation | Finding a non-obvious race condition in Python's GC finalizer ordering. |
feature_dict_pop_last | Feature addition | Adding dict.pop_last() — requires understanding CPython's dict internals. |
cross_session_set_cartesian | Feature + test | Implementing set cartesian product across multiple module interactions. |
debug_descriptor_priority | Bug investigation | Descriptor protocol priority ordering; requires knowing the Python data model. |
cross_session_bytes_find_all | Feature addition | Adding bytes.find_all() to the CPython bytes object implementation. |
cross_session_list_rotate | Feature addition | Adding list.rotate() — requires understanding list internals in listobject.c. |
The cross_session_* tasks exist specifically to test cross-session memory: the research session runs once and stores notes, then each implementation session starts fresh and calls vectr_recall. That separation is the whole point, and it's the part of the design that most naive benchmarks get wrong — which is the next section.
Two agents: vanilla vs vectr
Each task runs with two agents on the same codebase, same version, same starting state, same task prompt:
- Claude
claude-sonnet-4-6 - Standard file tools only: Read, Bash, Write, Edit
- No vectr — navigates with grep and blind reads
- Same model, same standard tools
- Plus vectr's 13 MCP tools
- Workflow: status → recall → search → implement → remember
Both run under claude -p (non-interactive mode) with a tool-call ceiling so a confused session can't run away and poison the numbers. The vectr agent's intended loop is: call vectr_status() first, recall research notes if any exist, run vectr_search for initial navigation, implement, store key findings with vectr_remember, finish.
The Two-Phase Trap That Sinks Naive Benchmarks
Here is the piece worth slowing down on, because it's where most benchmarks for a memory tool go wrong before they collect a single number.
Vectr has a research phase and implementation phases. The research phase is one shared session: an agent explores the codebase, stores findings with vectr_remember, and exits. The implementation phases are six separate sessions, each calling vectr_recall at the start.
These two phases pull in opposite directions, and that's the whole subtlety.
The research phase costs more with vectr than vanilla. The agent does extra work — calling vectr_search to navigate, yes, but also vectr_remember to write down what it finds. More output tokens, more turns, higher cost. You're paying to build the map.
The implementation phases cost less with vectr. The agent calls vectr_recall once at the start, gets the relevant findings immediately, and skips the navigation the vanilla agent has to redo from scratch every time. You're spending the map.
Sum the two into a single "total sprint cost" and you get a number that blends those opposite dynamics into mush. The total can read positive — vectr costs more — even when the implementation phases show clear savings, because the one-time research overhead drowns them. That's exactly what happened in our runs.
| Phase | Vanilla | Vectr | Delta |
|---|---|---|---|
| Research (1 session, paid once) | $1.36 | $2.63 | +94% |
| Impl (6 sessions, each repeating) | $2.50 | $1.97 | −21% |
| Total sprint | $3.86 | $4.60 | +19% |
"Vectr costs +19%." That's the headline if you report the bottom row. It's also misleading, and here's the arithmetic that shows why. The research overhead is +$1.27, paid once. The implementation savings are −$0.53 per six-task sprint — about $0.088 per task — and they recur on every sprint that reuses those notes. The research investment doesn't get re-paid; the savings keep arriving. So the break-even is the point where the recurring per-task saving has clawed back the one-time overhead: $1.27 ÷ $0.088 ≈ 14 tasks. Past roughly fourteen or fifteen implementation tasks on the same notes, total cost with vectr drops below vanilla and keeps dropping. (Don't take my word for the cross-over — the calculator further down lets you set your own numbers and watch where it lands.)
For a memory tool, the research phase is an amortized investment: cost front-loaded, benefit running as long as the notes stay relevant. A benchmark that measures a single sprint — six tasks, well short of the ~14-task break-even — systematically overstates the tool's cost for any team that keeps working in the same codebase area. The fix isn't a better number. It's reporting the two phases separately and letting the reader place their own usage on the amortization curve.
The Implementation Phase Numbers
The implementation sessions are where "is vectr helping?" has a clean answer, because they isolate the spend-the-map side of the ledger from the build-the-map side. Across all six tasks combined:
| Metric | Vanilla | Vectr | Delta |
|---|---|---|---|
| Cost | $2.50 | $1.97 | −21% |
| Wall time | 17.6 min | 13.5 min | −24% |
| Turns | 123 | 94 | −24% |
| Read + Bash calls | 102 | 62 | −39% |
The Read+Bash reduction is the most informative line. Forty fewer file reads and grep calls across six tasks. That's where the time and cost actually come from — not the model writing shorter answers, but skipping navigation. Cost and time follow the tool-call count because tool calls are what drive both: each Read brings back tokens you pay for, each one costs a round trip of wall time. When navigation drops 39%, cost dropping 21% and time dropping 24% is the consequence, not a coincidence.
Now the per-task breakdown — the re-discovery tax (Read + Bash before the first write) for each task, vanilla versus vectr:
| Task | Vanilla | Vectr | Delta |
|---|---|---|---|
debug_gc_finalizer | 16 | 6 | −62% |
feature_dict_pop_last | 13 | 3 | −77% |
cross_session_set_cartesian | 23 | 9 | −61% |
debug_descriptor_priority | 6 | 6 | 0% |
cross_session_bytes_find_all | 13 | 2 | −85% |
cross_session_list_rotate | 21 | 16 | −24% |
Five of six tasks show meaningful reduction. One — debug_descriptor_priority — shows nothing. That zero is more interesting than any of the wins, and it gets its own section. First, the demo lets you see the spread.
The 0% Case: What Vectr Doesn't Help With
debug_descriptor_priority is a descriptor protocol bug — working out why __get__ fires at an unexpected priority relative to __getattribute__. The fix is a three-line change in a very specific part of the descriptor lookup chain. Vanilla's re-discovery was 6 calls. Vectr's was also 6. No improvement.
Why? My read: the descriptor protocol is one of the most thoroughly documented corners of Python's data model. Every senior Python engineer knows roughly where it lives in CPython source. The model's training knowledge is strong enough that it lands on the right file with high confidence on the first or second Read — not because it knows this codebase, but because it knows Python deeply. There's nothing for a search tool to discover that the model didn't already carry in.
There's a general principle here. Vectr's navigation advantage is largest when three things hold at once:
- Codebase is large and unfamiliar
- Naming isn't inferrable from domain knowledge
- Code isn't in a well-known framework or stdlib
- Well-known codebase the model has seen
- Obvious, conventional naming
- Deep training coverage of the exact area
debug_descriptor_priority sits hard at the right-hand column. The 0% is expected and honest. A tool that improved every task regardless of these conditions would be a red flag, not a feature — it would mean the benchmark wasn't measuring anything real.
The cleanest predictor is the vanilla re-discovery count: how many reads did the baseline agent need before its first write? If vanilla is low (≤5), the codebase was already easy to navigate and the search tools add little. If vanilla is high (≥12), vectr's improvement tends to scale with how unfamiliar the code is. Here, debug_descriptor_priority had vanilla = 6; cross_session_set_cartesian had vanilla = 23. Look at the baseline before you predict the win.
The B9 Bug and Why Earlier Runs Lied
The benchmarks didn't all start from a correct implementation, and that matters for reading any number from this series.
In runs B3 through B6, vectr_recall was broken: it used SQL LIKE substring matching instead of semantic search over the vector store. So even though the research phase had stored detailed notes about all six task areas, the implementation sessions' recall calls came back empty or near-empty. Those sessions paid the overhead of MCP setup and recall calls and got none of the benefit. Predictably, they showed vectr costing more than vanilla on implementation — which looked like a product failure but was a single broken function.
B7 was a full re-run after B9 fixed recall. The difference is stark.
| Metric | Pre-B9 (broken recall) | Post-B9 (B7) |
|---|---|---|
| Recall firing with results | 0 / 6 impl tasks | 4 / 6 impl tasks |
| Impl session cost vs vanilla | +8% to +25% | −21% |
| R+B calls vs vanilla | −5% to +15% | −39% |
The two tasks where recall still didn't fire have different explanations, and neither is a bug:
- ·
debug_descriptor_priority— vanilla re-discovery was only 6, so recall would have helped minimally even if it had fired. Nothing to recover. - ·
cross_session_list_rotate— the research session stored notes, but the impl session's recall query didn't match them well enough. The notes were broad (general list internals); the task was specific (list.rotate()neededlistobject.c's memory layout in detail). That's a recall precision problem, not a bug — the notes existed but weren't sharp enough.
Which is its own honest finding: vectr helps most when the research notes closely match the implementation tasks. Research that's too broad — storing general codebase knowledge instead of task-specific findings — dilutes the recall benefit. The granularity of what you write down is itself a variable that decides whether the tool pays off.
For four benchmark runs, the data said "vectr makes implementation more expensive." That conclusion was wrong, and it was caused by one function using substring matching where it needed vector similarity. The lesson for anyone benchmarking their own tool: a bad number is a hypothesis about your code as much as a verdict on your idea. Before you trust a disappointing result, confirm the path it depends on actually works end to end.
The Tool Usage Breakdown
Across the six implementation sessions, the vectr agent's tool usage tells you exactly how the savings happen:
| Tool | Calls | Primary use |
|---|---|---|
vectr_recall | 4 | Session start — retrieve research notes. |
vectr_status | 5 | Session start — check if notes exist. |
vectr_search | 1 | Mid-session — find a specific function not in notes. |
| standard Read | 62 | File reads after navigation. |
| standard Bash (grep) | 0 | Replaced entirely by vectr_locate. |
The pattern is clean: vectr tools replace Bash (grep down to zero) and substantially cut Read. The dominant use is status + recall at session start, not mid-session search — which is exactly the intended design. Recall the research notes first, implement from them, only reach for vectr_search for the specifics the notes didn't cover.
The zero Bash calls deserve a closer look. Vanilla implementation sessions leaned on grep heavily for exact symbol finding — grepping function names, class definitions, imports. The vectr sessions used vectr_locate for the same job. The difference: vectr_locate returns the definition's location without reading the file, while grep is a Bash call that also drags back surrounding context you then pay to process. Same question, cheaper answer.
What A Controlled Benchmark Can't Measure
Every controlled benchmark simplifies in ways real usage doesn't. Here's where I think this methodology is weakest — stated plainly, because a limitations section that buries the real gaps is just marketing with footnotes.
Fixed codebase version
Both agents work on the same frozen CPython. In real life the codebase moves under you, and stale notes turn into a liability — a note written against an older version may describe behavior that has since changed. The [STALE] marker helps, but it only fires when a file is renamed or deleted, not when its contents change. The right question for a stale note isn't "does the file still exist?" but "does the note still accurately describe the code?" — and that's much harder to detect automatically. In practice you have to build the habit of verifying a note against current code before trusting it for precise work.
Two agents, no human in the loop
This compares agent against agent. It does not measure human productivity. An engineer who's lived in a codebase for six months already carries the equivalent of research notes in their head; vectr's most direct benefit is the human-plus-AI workflow where the AI does exploration that would otherwise burn many turns. "Vectr makes developers X% more productive" is a stronger claim than this benchmark can support, and I'm not going to make it.
Six tasks in one codebase area
All six tasks live in CPython's core implementation, which shares key files and modules. So the research notes from one task are partially reusable for others — not by design, but because the code regions overlap. That inflates the apparent benefit of note reuse. A more rigorous version would use tasks in completely non-overlapping regions, to measure recall precision independently of accidental cross-task reuse.
No quality measurement
We measured cost, time, and tool calls. We did not measure correctness in any deep sense. Did both agents produce correct implementations? Did the vectr agent ever ship a bug because it trusted a stale recall instead of reading current code? In our sessions both agents passed the relevant CPython tests — but "passed tests" is a coarse correctness measure for a bug-fix task. I'd want task-specific human review before claiming quality held.
A faster, cheaper session that produces wrong output is strictly worse than a slow, expensive one that produces correct output. Never read a cost or time reduction as a quality improvement without separately checking correctness. These benchmarks checked test passage, not human review — that's a real methodological gap, and pretending otherwise would undercut everything else here.
Why The Real Benchmark Is Adoption
The most honest thing I can say about controlled benchmarks for a tool like this: they're useful for debugging and for showing the tool doesn't make things worse, but they're not how you find out whether it actually helps in production.
The CPython and Camel runs were designed by me, run by me, with agents I configured. The research phase was written by me — so the quality of those notes reflects my sense of what's worth writing down. An independent user, on a codebase they actually work in, writing notes about things they actually need, will produce different notes, different recall patterns, different tasks. Their numbers won't be mine.
The right benchmark for a developer tool is many users, many codebases, over time, with self-reported productivity. I can't manufacture that. What I can do is make vectr cheap enough to try that real users run it on their own code and find out for themselves. The published numbers aren't the product — they're evidence the approach isn't broken. Adoption is what produces the signal that matters.
Two things make that experiment near-free to run:
Local model — no API cost, no data leaving the machine
The embedding model runs locally. The barrier to "try it on your proprietary codebase" is essentially zero, because you're not sending your code anywhere. For most teams that's the difference between "maybe later" and "I'll run it tonight."
MCP protocol — no per-editor plugin
One vectr server works with Claude Code, Cursor, VS Code + Copilot, Windsurf, Continue, and Cline because they all speak MCP. Integration cost is two lines of JSON config, not a plugin per editor.
So the bet is asymmetric. If vectr fits your situation — large, unfamiliar, unconventionally named code — the savings are real. If it doesn't — small, well-known, deep training coverage — you've lost the five minutes it took to install. That's a reasonable experiment, and it's a far better source of truth than any number I can publish.
Reading The Benchmark Numbers Honestly
Let me be direct about what I think the numbers show and don't.
- On 5 of 6 CPython tasks, re-discovery dropped meaningfully (−24% to −85%)
- Across all 6 impl sessions: cost −21%, time −24%, R+B −39%
- The B9 semantic-recall fix was critical; without it, vectr added cost
- The research vs impl distinction matters; +19% total is the wrong headline
- Whether this generalizes to your codebase and task type
- Whether quality holds (test passage only, not human review)
- Long-term stale-note dynamics over weeks of real work
- Human productivity — we measured agents, not human+AI
And to make the boundary concrete rather than abstract:
- Well-known frameworks with deep training data (stdlib, popular internals)
- Small codebases (under ~500 files) where grep is fast
- Short tasks (under ~30 tool calls) where setup overhead dominates
- First week in an unfamiliar codebase with non-obvious naming
- Multi-session work on a large proprietary system where /compact fires
- Cross-cutting tasks spanning three subsystems before you touch one
What I'd Measure Next
If I were extending this, four additions would matter more than another CPython run.
Quality scoring by human review
Test passage is necessary, not sufficient. I'd want at least two engineers independently assessing each output for correctness, idiomatic quality, and edge-case handling — the things a test suite quietly lets through.
Stale-note tracking
After four weeks of normal development, how many notes have gone stale? Does the [STALE] marker catch the important ones, given it only fires on path changes? How much does relying on a stale note degrade output? This is the gap between the benchmark's frozen codebase and real life, and it's the one I'd close first.
Independent user runs
Give vectr to five engineers who've never used it, on codebases they pick, with tasks they set — and collect their before/after observations without telegraphing the result I expect. That's the closest thing to the real benchmark that I could actually run.
Recall at scale
Our research sessions stored 15–20 notes per task area. Real long-term use might accumulate 200–500 notes over months. Does recall precision hold at that scale? Does the tag-and-priority system do enough to keep recall relevant when the haystack is an order of magnitude larger? And cross_session_list_rotate showed the failure mode worth characterizing precisely: what note granularity is needed for what task type?
Conclusion — Three Posts, Three Layers
Three posts, three layers of the same system, and now they close.
Part 1 — the indexer. AST-aware chunking, code-specific local embeddings, hybrid search, a symbol graph with five fallback strategies. The goal: find any function or concept in a 10,000-file codebase in under 20ms, in a single call.
Part 2 — working memory. Notes that survive /compact and session boundaries. Semantic recall that retrieves by concept, not substring. The correct framing — recall-cost avoidance, not token release — and what it took to get there after building the wrong thing first.
Part 3 — measurement. The research vs implementation distinction that makes the +19% total sprint cost misleading. The five of six tasks where re-discovery fell sharply, the one where it didn't, and why that one is more informative than the five that worked. The limitations that decide whether any of this applies to you.
The honest version of the value proposition: on large, unfamiliar codebases, vectr cuts the re-discovery tax an AI code editor pays on every implementation session. The research investment compounds across tasks. The implementation savings are real enough that somewhere past the dozen-or-so-task mark — fourteen on the CPython numbers — the math turns positive and stays that way.
Whether that's your situation — large and unfamiliar enough, enough tasks in the same area for the investment to pay off, the five-minute install worth the experiment — is the one thing I genuinely can't tell you from here. The benchmark shows the tool isn't broken. Adoption shows whether it works for you. That's the only honest place to end a series about measuring your own work.
References & Further Reading
-
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
The reference point for evaluating coding agents on real repositories. Useful contrast: SWE-bench measures task resolution, not navigation overhead — the metric this post argues is the missing one for memory tools. -
Lost in the Middle: How Language Models Use Long Contexts
Why piling navigation context into the window hurts rather than helps — the positional degradation that makes reducing the re-discovery tax worth doing in the first place.
-
Model Context Protocol Specification
The open standard behind the "one server, every editor" claim. Why vectr ships as an MCP server rather than per-editor plugins. -
MemGPT: Towards LLMs as Operating Systems
The virtual-context framing that informed vectr's working-memory design and the thresholds for prompting note-saving before context degrades.
-
Vectr — Semantic Code Search + Working Memory for AI Editors
Setup instructions, documentation for all 13 MCP tools, and the CLAUDE.md template referenced throughout this series. -
Building Vectr, Part 1: Why grep Fails When You Don't Know the Keywords
The indexing layer the implementation savings rest on: AST chunking, hybrid BM25+vector search, symbol graphs. -
Building Vectr, Part 2: What /compact Destroys and How to Survive It
The working-memory layer the research phase depends on, including the B9 bug that this post's numbers were re-run to fix.