Building Vectr — Part 3 of 3

What the benchmark numbers actually mean — and the methodology choices that decide whether a headline number is honest or a lie you told yourself.

Building Vectr Series

A benchmark number without context is worse than no number at all. No number leaves you uncertain. A number without context leaves you confidently wrong — which is a strictly worse place to be.

The most dangerous move when you evaluate a developer tool is to run a benchmark and post the headline figure with no account of what it measures, what it ignores, and where the methodology quietly assumes things that won't hold for the person reading it. I've watched this fail in both directions. A tool advertises "+40% productivity" on a task so narrow the tool basically can't lose. Another reports "+3% cost reduction" while its own overhead is folded into the denominator, hiding the per-task savings that were the whole point.

This is the third and final post in the Building Vectr series. Part 1 built the semantic index — finding things in a codebase by meaning instead of keyword. Part 2 built the working memory that survives /compact and session boundaries. Now comes the part where I have to find out if any of it was worth building. I'll walk through exactly what Vectr was measured on, the calls I made about methodology, where those calls are shaky, and — the part most write-ups skip — what the numbers genuinely do not let you conclude.

Part 1
What We Were Trying To Measure
01

The Re-Discovery Tax

The hypothesis is simple. An AI code editor working on an unfamiliar codebase spends a real fraction of its time and token budget on navigation — finding where things are — rather than on implementation. Cut the navigation cost and you cut total session cost and time without touching output quality.

So the thing to measure is the re-discovery cost per implementation task: how much does the agent spend figuring out where things live before it starts writing code?

That is not the same as total session cost. Total session cost is everything — navigation, reasoning, writing, testing, verification. If your tool kills navigation overhead but adds its own overhead in search and recall calls, you have to measure both sides or you don't know whether the net is positive. This is the most common way benchmarks for memory tools mislead: they measure one side and report it as if it were the whole ledger.

The metric I settled on: Read + Bash calls before the first file write. I call it the re-discovery tax. It is not perfect. A sharp agent might read only a couple of large architectural files and then implement fast; a weaker one might read many small files and still be lost. But as a first-order proxy for "how much did you spend orienting versus doing the work," it captures the thing I care about, and it has the virtue of being unambiguous to count from a session transcript.

Secondary metrics: total session cost in dollars (from API pricing), wall time, and total turn count. These triangulate the primary metric — if re-discovery drops but cost doesn't, something is off.

Analogy — The plumber's first hour

Hire a plumber for an unfamiliar old house. The first hour is rarely plumbing. It's finding the shutoff valve, tracing which pipe feeds which fixture, working out why the previous owner ran a line through the wrong wall. The re-discovery tax is that first hour. A plumber who has the building's plans in hand skips most of it and starts cutting pipe. Vectr's bet is that a good index plus stored findings is the building's plans — and the benchmark is asking how much of that first hour it actually removes.

02

The Benchmark Design: CPython and Six Tasks

Why CPython

I picked CPython for four reasons. It's a real production codebase, not a toy — roughly 4,000+ files spanning C, Python, and a little assembly. It's public, so there's no IP problem in using it as benchmark material. It's genuinely unfamiliar to most people: even seasoned Python engineers don't spelunk through CPython internals daily, so the "unfamiliar codebase" condition is satisfied for free. And the tasks need an understanding of non-obvious system behavior rather than pattern-matching on the obvious — which is exactly where navigation overhead runs highest.

I also ran earlier benchmarks on Apache Camel (Java, 5,856 files) to check the results weren't a CPython artifact. Camel came out directionally similar and actually a bit stronger — probably because it's larger and its naming conventions are less guessable than CPython's reasonably disciplined ones.

The six tasks

Six tasks chosen to span different kinds of implementation work — two bug investigations and four feature additions, with three of them designed to test cross-session recall:

TaskTypeWhat it involves
debug_gc_finalizerBug investigationFinding a non-obvious race condition in Python's GC finalizer ordering.
feature_dict_pop_lastFeature additionAdding dict.pop_last() — requires understanding CPython's dict internals.
cross_session_set_cartesianFeature + testImplementing set cartesian product across multiple module interactions.
debug_descriptor_priorityBug investigationDescriptor protocol priority ordering; requires knowing the Python data model.
cross_session_bytes_find_allFeature additionAdding bytes.find_all() to the CPython bytes object implementation.
cross_session_list_rotateFeature additionAdding list.rotate() — requires understanding list internals in listobject.c.

The cross_session_* tasks exist specifically to test cross-session memory: the research session runs once and stores notes, then each implementation session starts fresh and calls vectr_recall. That separation is the whole point, and it's the part of the design that most naive benchmarks get wrong — which is the next section.

Two agents: vanilla vs vectr

Each task runs with two agents on the same codebase, same version, same starting state, same task prompt:

Vanilla
  • Claude claude-sonnet-4-6
  • Standard file tools only: Read, Bash, Write, Edit
  • No vectr — navigates with grep and blind reads
Vectr
  • Same model, same standard tools
  • Plus vectr's 13 MCP tools
  • Workflow: status → recall → search → implement → remember

Both run under claude -p (non-interactive mode) with a tool-call ceiling so a confused session can't run away and poison the numbers. The vectr agent's intended loop is: call vectr_status() first, recall research notes if any exist, run vectr_search for initial navigation, implement, store key findings with vectr_remember, finish.

03

The Two-Phase Trap That Sinks Naive Benchmarks

Here is the piece worth slowing down on, because it's where most benchmarks for a memory tool go wrong before they collect a single number.

Vectr has a research phase and implementation phases. The research phase is one shared session: an agent explores the codebase, stores findings with vectr_remember, and exits. The implementation phases are six separate sessions, each calling vectr_recall at the start.

These two phases pull in opposite directions, and that's the whole subtlety.

The research phase costs more with vectr than vanilla. The agent does extra work — calling vectr_search to navigate, yes, but also vectr_remember to write down what it finds. More output tokens, more turns, higher cost. You're paying to build the map.

The implementation phases cost less with vectr. The agent calls vectr_recall once at the start, gets the relevant findings immediately, and skips the navigation the vanilla agent has to redo from scratch every time. You're spending the map.

Sum the two into a single "total sprint cost" and you get a number that blends those opposite dynamics into mush. The total can read positive — vectr costs more — even when the implementation phases show clear savings, because the one-time research overhead drowns them. That's exactly what happened in our runs.

PhaseVanillaVectrDelta
Research (1 session, paid once)$1.36$2.63+94%
Impl (6 sessions, each repeating)$2.50$1.97−21%
Total sprint$3.86$4.60+19%

"Vectr costs +19%." That's the headline if you report the bottom row. It's also misleading, and here's the arithmetic that shows why. The research overhead is +$1.27, paid once. The implementation savings are −$0.53 per six-task sprint — about $0.088 per task — and they recur on every sprint that reuses those notes. The research investment doesn't get re-paid; the savings keep arriving. So the break-even is the point where the recurring per-task saving has clawed back the one-time overhead: $1.27 ÷ $0.088 ≈ 14 tasks. Past roughly fourteen or fifteen implementation tasks on the same notes, total cost with vectr drops below vanilla and keeps dropping. (Don't take my word for the cross-over — the calculator further down lets you set your own numbers and watch where it lands.)

A one-sprint benchmark measures the wrong thing

For a memory tool, the research phase is an amortized investment: cost front-loaded, benefit running as long as the notes stay relevant. A benchmark that measures a single sprint — six tasks, well short of the ~14-task break-even — systematically overstates the tool's cost for any team that keeps working in the same codebase area. The fix isn't a better number. It's reporting the two phases separately and letting the reader place their own usage on the amortization curve.

Part 2
The Numbers That Answer The Question
04

The Implementation Phase Numbers

The implementation sessions are where "is vectr helping?" has a clean answer, because they isolate the spend-the-map side of the ledger from the build-the-map side. Across all six tasks combined:

MetricVanillaVectrDelta
Cost$2.50$1.97−21%
Wall time17.6 min13.5 min−24%
Turns12394−24%
Read + Bash calls10262−39%

The Read+Bash reduction is the most informative line. Forty fewer file reads and grep calls across six tasks. That's where the time and cost actually come from — not the model writing shorter answers, but skipping navigation. Cost and time follow the tool-call count because tool calls are what drive both: each Read brings back tokens you pay for, each one costs a round trip of wall time. When navigation drops 39%, cost dropping 21% and time dropping 24% is the consequence, not a coincidence.

Now the per-task breakdown — the re-discovery tax (Read + Bash before the first write) for each task, vanilla versus vectr:

TaskVanillaVectrDelta
debug_gc_finalizer166−62%
feature_dict_pop_last133−77%
cross_session_set_cartesian239−61%
debug_descriptor_priority660%
cross_session_bytes_find_all132−85%
cross_session_list_rotate2116−24%

Five of six tasks show meaningful reduction. One — debug_descriptor_priority — shows nothing. That zero is more interesting than any of the wins, and it gets its own section. First, the demo lets you see the spread.

Interactive Demo
Per-Task Re-Discovery, Side by Side
See where the navigation savings come from and where they don't. Each task shows the re-discovery tax — file reads and greps before the first write — for the vanilla agent against the vectr agent, drawn to scale so the one task with no improvement is as visible as the four big wins.
Vanilla Vectr
Bars scale to the largest value (23, cross_session_set_cartesian vanilla). The flat pair on debug_descriptor_priority is the honest result — a task where the model's own knowledge already navigates efficiently.
05

The 0% Case: What Vectr Doesn't Help With

debug_descriptor_priority is a descriptor protocol bug — working out why __get__ fires at an unexpected priority relative to __getattribute__. The fix is a three-line change in a very specific part of the descriptor lookup chain. Vanilla's re-discovery was 6 calls. Vectr's was also 6. No improvement.

Why? My read: the descriptor protocol is one of the most thoroughly documented corners of Python's data model. Every senior Python engineer knows roughly where it lives in CPython source. The model's training knowledge is strong enough that it lands on the right file with high confidence on the first or second Read — not because it knows this codebase, but because it knows Python deeply. There's nothing for a search tool to discover that the model didn't already carry in.

There's a general principle here. Vectr's navigation advantage is largest when three things hold at once:

Vectr wins big
  • Codebase is large and unfamiliar
  • Naming isn't inferrable from domain knowledge
  • Code isn't in a well-known framework or stdlib
Vectr adds little
  • Well-known codebase the model has seen
  • Obvious, conventional naming
  • Deep training coverage of the exact area

debug_descriptor_priority sits hard at the right-hand column. The 0% is expected and honest. A tool that improved every task regardless of these conditions would be a red flag, not a feature — it would mean the benchmark wasn't measuring anything real.

Vanilla's re-discovery count predicts where vectr helps

The cleanest predictor is the vanilla re-discovery count: how many reads did the baseline agent need before its first write? If vanilla is low (≤5), the codebase was already easy to navigate and the search tools add little. If vanilla is high (≥12), vectr's improvement tends to scale with how unfamiliar the code is. Here, debug_descriptor_priority had vanilla = 6; cross_session_set_cartesian had vanilla = 23. Look at the baseline before you predict the win.

Interactive Demo
When Does The Research Cost Pay Itself Back?
Find your own break-even point. Set how many implementation tasks you run against the same research notes, and adjust the two costs that drive the amortization — the one-time research overhead and the per-sprint implementation saving. The calculator shows total cost both ways and the task count where vectr pulls ahead.
8
$1.27
$0.088
Vanilla total
$0.00
N tasks at baseline cost
Vectr total
$0.00
research once + N reduced tasks
Break-even at
tasks reusing the notes
Baseline per-task cost is set to $0.417 (vanilla impl $2.50 ÷ 6 tasks). Defaults reproduce the CPython run: $1.27 research overhead, $0.088 saved per task (−$0.53 ÷ 6), which puts break-even just under 15 tasks. The model is linear — it assumes notes stay relevant across all N tasks, the optimistic end; stale notes push the real break-even later, not earlier.
06

The B9 Bug and Why Earlier Runs Lied

The benchmarks didn't all start from a correct implementation, and that matters for reading any number from this series.

In runs B3 through B6, vectr_recall was broken: it used SQL LIKE substring matching instead of semantic search over the vector store. So even though the research phase had stored detailed notes about all six task areas, the implementation sessions' recall calls came back empty or near-empty. Those sessions paid the overhead of MCP setup and recall calls and got none of the benefit. Predictably, they showed vectr costing more than vanilla on implementation — which looked like a product failure but was a single broken function.

B7 was a full re-run after B9 fixed recall. The difference is stark.

MetricPre-B9 (broken recall)Post-B9 (B7)
Recall firing with results0 / 6 impl tasks4 / 6 impl tasks
Impl session cost vs vanilla+8% to +25%−21%
R+B calls vs vanilla−5% to +15%−39%

The two tasks where recall still didn't fire have different explanations, and neither is a bug:

  • · debug_descriptor_priority — vanilla re-discovery was only 6, so recall would have helped minimally even if it had fired. Nothing to recover.
  • · cross_session_list_rotate — the research session stored notes, but the impl session's recall query didn't match them well enough. The notes were broad (general list internals); the task was specific (list.rotate() needed listobject.c's memory layout in detail). That's a recall precision problem, not a bug — the notes existed but weren't sharp enough.

Which is its own honest finding: vectr helps most when the research notes closely match the implementation tasks. Research that's too broad — storing general codebase knowledge instead of task-specific findings — dilutes the recall benefit. The granularity of what you write down is itself a variable that decides whether the tool pays off.

A bug can masquerade as a product verdict

For four benchmark runs, the data said "vectr makes implementation more expensive." That conclusion was wrong, and it was caused by one function using substring matching where it needed vector similarity. The lesson for anyone benchmarking their own tool: a bad number is a hypothesis about your code as much as a verdict on your idea. Before you trust a disappointing result, confirm the path it depends on actually works end to end.

07

The Tool Usage Breakdown

Across the six implementation sessions, the vectr agent's tool usage tells you exactly how the savings happen:

ToolCallsPrimary use
vectr_recall4Session start — retrieve research notes.
vectr_status5Session start — check if notes exist.
vectr_search1Mid-session — find a specific function not in notes.
standard Read62File reads after navigation.
standard Bash (grep)0Replaced entirely by vectr_locate.

The pattern is clean: vectr tools replace Bash (grep down to zero) and substantially cut Read. The dominant use is status + recall at session start, not mid-session search — which is exactly the intended design. Recall the research notes first, implement from them, only reach for vectr_search for the specifics the notes didn't cover.

The zero Bash calls deserve a closer look. Vanilla implementation sessions leaned on grep heavily for exact symbol finding — grepping function names, class definitions, imports. The vectr sessions used vectr_locate for the same job. The difference: vectr_locate returns the definition's location without reading the file, while grep is a Bash call that also drags back surrounding context you then pay to process. Same question, cheaper answer.

Part 3
The Limits Of What This Proves
08

What A Controlled Benchmark Can't Measure

Every controlled benchmark simplifies in ways real usage doesn't. Here's where I think this methodology is weakest — stated plainly, because a limitations section that buries the real gaps is just marketing with footnotes.

Fixed codebase version

Both agents work on the same frozen CPython. In real life the codebase moves under you, and stale notes turn into a liability — a note written against an older version may describe behavior that has since changed. The [STALE] marker helps, but it only fires when a file is renamed or deleted, not when its contents change. The right question for a stale note isn't "does the file still exist?" but "does the note still accurately describe the code?" — and that's much harder to detect automatically. In practice you have to build the habit of verifying a note against current code before trusting it for precise work.

Two agents, no human in the loop

This compares agent against agent. It does not measure human productivity. An engineer who's lived in a codebase for six months already carries the equivalent of research notes in their head; vectr's most direct benefit is the human-plus-AI workflow where the AI does exploration that would otherwise burn many turns. "Vectr makes developers X% more productive" is a stronger claim than this benchmark can support, and I'm not going to make it.

Six tasks in one codebase area

All six tasks live in CPython's core implementation, which shares key files and modules. So the research notes from one task are partially reusable for others — not by design, but because the code regions overlap. That inflates the apparent benefit of note reuse. A more rigorous version would use tasks in completely non-overlapping regions, to measure recall precision independently of accidental cross-task reuse.

No quality measurement

We measured cost, time, and tool calls. We did not measure correctness in any deep sense. Did both agents produce correct implementations? Did the vectr agent ever ship a bug because it trusted a stale recall instead of reading current code? In our sessions both agents passed the relevant CPython tests — but "passed tests" is a coarse correctness measure for a bug-fix task. I'd want task-specific human review before claiming quality held.

Cheaper and faster but wrong is worse than slow but right

A faster, cheaper session that produces wrong output is strictly worse than a slow, expensive one that produces correct output. Never read a cost or time reduction as a quality improvement without separately checking correctness. These benchmarks checked test passage, not human review — that's a real methodological gap, and pretending otherwise would undercut everything else here.

09

Why The Real Benchmark Is Adoption

The most honest thing I can say about controlled benchmarks for a tool like this: they're useful for debugging and for showing the tool doesn't make things worse, but they're not how you find out whether it actually helps in production.

The CPython and Camel runs were designed by me, run by me, with agents I configured. The research phase was written by me — so the quality of those notes reflects my sense of what's worth writing down. An independent user, on a codebase they actually work in, writing notes about things they actually need, will produce different notes, different recall patterns, different tasks. Their numbers won't be mine.

The right benchmark for a developer tool is many users, many codebases, over time, with self-reported productivity. I can't manufacture that. What I can do is make vectr cheap enough to try that real users run it on their own code and find out for themselves. The published numbers aren't the product — they're evidence the approach isn't broken. Adoption is what produces the signal that matters.

Two things make that experiment near-free to run:

Local model — no API cost, no data leaving the machine

The embedding model runs locally. The barrier to "try it on your proprietary codebase" is essentially zero, because you're not sending your code anywhere. For most teams that's the difference between "maybe later" and "I'll run it tonight."

MCP protocol — no per-editor plugin

One vectr server works with Claude Code, Cursor, VS Code + Copilot, Windsurf, Continue, and Cline because they all speak MCP. Integration cost is two lines of JSON config, not a plugin per editor.

So the bet is asymmetric. If vectr fits your situation — large, unfamiliar, unconventionally named code — the savings are real. If it doesn't — small, well-known, deep training coverage — you've lost the five minutes it took to install. That's a reasonable experiment, and it's a far better source of truth than any number I can publish.

10

Reading The Benchmark Numbers Honestly

Let me be direct about what I think the numbers show and don't.

What they show
  • On 5 of 6 CPython tasks, re-discovery dropped meaningfully (−24% to −85%)
  • Across all 6 impl sessions: cost −21%, time −24%, R+B −39%
  • The B9 semantic-recall fix was critical; without it, vectr added cost
  • The research vs impl distinction matters; +19% total is the wrong headline
What they don't show
  • Whether this generalizes to your codebase and task type
  • Whether quality holds (test passage only, not human review)
  • Long-term stale-note dynamics over weeks of real work
  • Human productivity — we measured agents, not human+AI

And to make the boundary concrete rather than abstract:

Expect least help
  • Well-known frameworks with deep training data (stdlib, popular internals)
  • Small codebases (under ~500 files) where grep is fast
  • Short tasks (under ~30 tool calls) where setup overhead dominates
Expect most help
  • First week in an unfamiliar codebase with non-obvious naming
  • Multi-session work on a large proprietary system where /compact fires
  • Cross-cutting tasks spanning three subsystems before you touch one
11

What I'd Measure Next

If I were extending this, four additions would matter more than another CPython run.

Quality scoring by human review

Test passage is necessary, not sufficient. I'd want at least two engineers independently assessing each output for correctness, idiomatic quality, and edge-case handling — the things a test suite quietly lets through.

Stale-note tracking

After four weeks of normal development, how many notes have gone stale? Does the [STALE] marker catch the important ones, given it only fires on path changes? How much does relying on a stale note degrade output? This is the gap between the benchmark's frozen codebase and real life, and it's the one I'd close first.

Independent user runs

Give vectr to five engineers who've never used it, on codebases they pick, with tasks they set — and collect their before/after observations without telegraphing the result I expect. That's the closest thing to the real benchmark that I could actually run.

Recall at scale

Our research sessions stored 15–20 notes per task area. Real long-term use might accumulate 200–500 notes over months. Does recall precision hold at that scale? Does the tag-and-priority system do enough to keep recall relevant when the haystack is an order of magnitude larger? And cross_session_list_rotate showed the failure mode worth characterizing precisely: what note granularity is needed for what task type?

Conclusion — Three Posts, Three Layers

Three posts, three layers of the same system, and now they close.

Part 1 — the indexer. AST-aware chunking, code-specific local embeddings, hybrid search, a symbol graph with five fallback strategies. The goal: find any function or concept in a 10,000-file codebase in under 20ms, in a single call.

Part 2 — working memory. Notes that survive /compact and session boundaries. Semantic recall that retrieves by concept, not substring. The correct framing — recall-cost avoidance, not token release — and what it took to get there after building the wrong thing first.

Part 3 — measurement. The research vs implementation distinction that makes the +19% total sprint cost misleading. The five of six tasks where re-discovery fell sharply, the one where it didn't, and why that one is more informative than the five that worked. The limitations that decide whether any of this applies to you.

The honest version of the value proposition: on large, unfamiliar codebases, vectr cuts the re-discovery tax an AI code editor pays on every implementation session. The research investment compounds across tasks. The implementation savings are real enough that somewhere past the dozen-or-so-task mark — fourteen on the CPython numbers — the math turns positive and stays that way.

Whether that's your situation — large and unfamiliar enough, enough tasks in the same area for the investment to pay off, the five-minute install worth the experiment — is the one thing I genuinely can't tell you from here. The benchmark shows the tool isn't broken. Adoption shows whether it works for you. That's the only honest place to end a series about measuring your own work.

↑ Back to top
§

References & Further Reading

Benchmarking & Evaluation
Protocol & Tooling
  • Model Context Protocol Specification
    Anthropic · 2024
    The open standard behind the "one server, every editor" claim. Why vectr ships as an MCP server rather than per-editor plugins.
  • MemGPT: Towards LLMs as Operating Systems
    Packer et al. · arXiv:2310.08560
    The virtual-context framing that informed vectr's working-memory design and the thresholds for prompting note-saving before context degrades.
Vectr