Why is the +19% total sprint cost a misleading headline for a memory tool?

Because it sums two costs with opposite dynamics. The research phase runs once and costs more with the tool (+94%) because the agent does extra work storing findings. The implementation phases cost less (−21%) and repeat every sprint. Adding them gives +19% for one sprint, but the research overhead is a one-time investment that amortizes. With a $1.27 one-time overhead and about $0.088 saved per task, break-even lands near 14 tasks reusing the same research notes; past that, total cost falls below the baseline.

What is the re-discovery tax in an AI coding session?

The re-discovery tax is the number of Read and Bash (grep) calls an AI agent makes before its first file write — the work spent figuring out where things are before doing the actual implementation. It is a first-order measure of navigation overhead. On unfamiliar codebases this can dominate a session; reducing it is the core thing a semantic index plus working memory aims to do.

Why did one of the six CPython tasks show zero improvement?

The debug_descriptor_priority task covers Python's descriptor protocol, one of the most extensively documented parts of the data model. The model's training knowledge is strong enough to navigate to the right file on the first or second read, so semantic search adds little over the model's own navigation. The advantage of a search tool is largest precisely where training coverage is weakest — large, unfamiliar, unconventionally named code.

What was the B9 bug and why did it make earlier benchmarks look bad?

Before the B9 fix, recall used SQL LIKE substring matching instead of semantic search over the vector store. Research notes existed, but implementation sessions got empty recall results — paying the tool overhead with none of the benefit. Earlier runs therefore showed the tool costing more. After the fix, recall fired with results on 4 of 6 tasks and implementation cost dropped 21% versus the baseline.

What can a controlled two-agent benchmark not measure?

It cannot measure human productivity (it compares agent vs agent), output quality beyond test passage, long-term behavior of stale notes as a codebase changes, or whether results generalize to your specific codebase. It runs on a frozen codebase version with tasks that share code regions, which inflates note reuse. These are real limitations, not footnotes.

Where does a semantic code search and memory tool help most and least?

It helps most on large, unfamiliar, proprietary codebases with non-obvious naming, and on multi-session work where /compact fires regularly. It helps least on small codebases under ~500 files where grep is fast, on well-known frameworks with deep training coverage, and on short tasks under ~30 tool calls where setup overhead dominates.

Why publish benchmark numbers if they cannot prove the tool helps in production?

Controlled benchmarks are good for debugging and for showing the tool does not make things worse — they are evidence the approach is not broken. They are not the same as proof it helps your team. The right signal comes from many users on many codebases over time. A local model with no API cost and an MCP server that works across editors makes that experiment cheap to run yourself.

Building Vectr · Series Part 3 of 3

Building Vectr — Part 3 of 3

What the benchmark numbers actually mean — and the methodology choices that decide whether a headline number is honest or a lie you told yourself.

Date 14 June 2026

Read ~36 min

Series Building Vectr (3/3)

Sections 11

A benchmark number without context is worse than no number at all. No number leaves you uncertain. A number without context leaves you confidently wrong — which is a strictly worse place to be.

The most dangerous move when you evaluate a developer tool is to run a benchmark and post the headline figure with no account of what it measures, what it ignores, and where the methodology quietly assumes things that won't hold for the person reading it. I've watched this fail in both directions. A tool advertises "+40% productivity" on a task so narrow the tool basically can't lose. Another reports "+3% cost reduction" while its own overhead is folded into the denominator, hiding the per-task savings that were the whole point.

This is the third and final post in the Building Vectr series. Part 1 built the semantic index — finding things in a codebase by meaning instead of keyword. Part 2 built the working memory that survives /compact and session boundaries. Now comes the part where I have to find out if any of it was worth building. I'll walk through exactly what Vectr was measured on, the calls I made about methodology, where those calls are shaky, and — the part most write-ups skip — what the numbers genuinely do not let you conclude.

Part 1

What We Were Trying To Measure

The Re-Discovery Tax

The hypothesis is simple. An AI code editor working on an unfamiliar codebase spends a real fraction of its time and token budget on navigation — finding where things are — rather than on implementation. Cut the navigation cost and you cut total session cost and time without touching output quality.

So the thing to measure is the re-discovery cost per implementation task: how much does the agent spend figuring out where things live before it starts writing code?

That is not the same as total session cost. Total session cost is everything — navigation, reasoning, writing, testing, verification. If your tool kills navigation overhead but adds its own overhead in search and recall calls, you have to measure both sides or you don't know whether the net is positive. This is the most common way benchmarks for memory tools mislead: they measure one side and report it as if it were the whole ledger.

The metric I settled on: Read + Bash calls before the first file write. I call it the re-discovery tax. It is not perfect. A sharp agent might read only a couple of large architectural files and then implement fast; a weaker one might read many small files and still be lost. But as a first-order proxy for "how much did you spend orienting versus doing the work," it captures the thing I care about, and it has the virtue of being unambiguous to count from a session transcript.

Secondary metrics: total session cost in dollars (from API pricing), wall time, and total turn count. These triangulate the primary metric — if re-discovery drops but cost doesn't, something is off.

Analogy — The plumber's first hour

Hire a plumber for an unfamiliar old house. The first hour is rarely plumbing. It's finding the shutoff valve, tracing which pipe feeds which fixture, working out why the previous owner ran a line through the wrong wall. The re-discovery tax is that first hour. A plumber who has the building's plans in hand skips most of it and starts cutting pipe. Vectr's bet is that a good index plus stored findings is the building's plans — and the benchmark is asking how much of that first hour it actually removes.

The Benchmark Design: CPython and Six Tasks

Why CPython

I picked CPython for four reasons. It's a real production codebase, not a toy — roughly 4,000+ files spanning C, Python, and a little assembly. It's public, so there's no IP problem in using it as benchmark material. It's genuinely unfamiliar to most people: even seasoned Python engineers don't spelunk through CPython internals daily, so the "unfamiliar codebase" condition is satisfied for free. And the tasks need an understanding of non-obvious system behavior rather than pattern-matching on the obvious — which is exactly where navigation overhead runs highest.

I also ran earlier benchmarks on Apache Camel (Java, 5,856 files) to check the results weren't a CPython artifact. Camel came out directionally similar and actually a bit stronger — probably because it's larger and its naming conventions are less guessable than CPython's reasonably disciplined ones.

The six tasks

Six tasks chosen to span different kinds of implementation work — two bug investigations and four feature additions, with three of them designed to test cross-session recall:

Task	Type	What it involves
`debug_gc_finalizer`	Bug investigation	Finding a non-obvious race condition in Python's GC finalizer ordering.
`feature_dict_pop_last`	Feature addition	Adding `dict.pop_last()` — requires understanding CPython's dict internals.
`cross_session_set_cartesian`	Feature + test	Implementing set cartesian product across multiple module interactions.
`debug_descriptor_priority`	Bug investigation	Descriptor protocol priority ordering; requires knowing the Python data model.
`cross_session_bytes_find_all`	Feature addition	Adding `bytes.find_all()` to the CPython bytes object implementation.
`cross_session_list_rotate`	Feature addition	Adding `list.rotate()` — requires understanding list internals in `listobject.c`.

The cross_session_* tasks exist specifically to test cross-session memory: the research session runs once and stores notes, then each implementation session starts fresh and calls vectr_recall. That separation is the whole point, and it's the part of the design that most naive benchmarks get wrong — which is the next section.

Two agents: vanilla vs vectr

Each task runs with two agents on the same codebase, same version, same starting state, same task prompt:

Vanilla

Claude claude-sonnet-4-6
Standard file tools only: Read, Bash, Write, Edit
No vectr — navigates with grep and blind reads

Vectr

Same model, same standard tools
Plus vectr's 13 MCP tools
Workflow: status → recall → search → implement → remember

Both run under claude -p (non-interactive mode) with a tool-call ceiling so a confused session can't run away and poison the numbers. The vectr agent's intended loop is: call vectr_status() first, recall research notes if any exist, run vectr_search for initial navigation, implement, store key findings with vectr_remember, finish.

The Two-Phase Trap That Sinks Naive Benchmarks

Here is the piece worth slowing down on, because it's where most benchmarks for a memory tool go wrong before they collect a single number.

Vectr has a research phase and implementation phases. The research phase is one shared session: an agent explores the codebase, stores findings with vectr_remember, and exits. The implementation phases are six separate sessions, each calling vectr_recall at the start.

These two phases pull in opposite directions, and that's the whole subtlety.

The research phase costs more with vectr than vanilla. The agent does extra work — calling vectr_search to navigate, yes, but also vectr_remember to write down what it finds. More output tokens, more turns, higher cost. You're paying to build the map.

The implementation phases cost less with vectr. The agent calls vectr_recall once at the start, gets the relevant findings immediately, and skips the navigation the vanilla agent has to redo from scratch every time. You're spending the map.

Sum the two into a single "total sprint cost" and you get a number that blends those opposite dynamics into mush. The total can read positive — vectr costs more — even when the implementation phases show clear savings, because the one-time research overhead drowns them. That's exactly what happened in our runs.

Phase	Vanilla	Vectr	Delta
Research (1 session, paid once)	$1.36	$2.63	+94%
Impl (6 sessions, each repeating)	$2.50	$1.97	−21%
Total sprint	$3.86	$4.60	+19%

"Vectr costs +19%." That's the headline if you report the bottom row. It's also misleading, and here's the arithmetic that shows why. The research overhead is +$1.27, paid once. The implementation savings are −$0.53 per six-task sprint — about $0.088 per task — and they recur on every sprint that reuses those notes. The research investment doesn't get re-paid; the savings keep arriving. So the break-even is the point where the recurring per-task saving has clawed back the one-time overhead: $1.27 ÷ $0.088 ≈ 14 tasks. Past roughly fourteen or fifteen implementation tasks on the same notes, total cost with vectr drops below vanilla and keeps dropping. (Don't take my word for the cross-over — the calculator further down lets you set your own numbers and watch where it lands.)

A one-sprint benchmark measures the wrong thing

For a memory tool, the research phase is an amortized investment: cost front-loaded, benefit running as long as the notes stay relevant. A benchmark that measures a single sprint — six tasks, well short of the ~14-task break-even — systematically overstates the tool's cost for any team that keeps working in the same codebase area. The fix isn't a better number. It's reporting the two phases separately and letting the reader place their own usage on the amortization curve.

Part 2

The Numbers That Answer The Question

The Implementation Phase Numbers

The implementation sessions are where "is vectr helping?" has a clean answer, because they isolate the spend-the-map side of the ledger from the build-the-map side. Across all six tasks combined:

Metric	Vanilla	Vectr	Delta
Cost	$2.50	$1.97	−21%
Wall time	17.6 min	13.5 min	−24%
Turns	123	94	−24%
Read + Bash calls	102	62	−39%

The Read+Bash reduction is the most informative line. Forty fewer file reads and grep calls across six tasks. That's where the time and cost actually come from — not the model writing shorter answers, but skipping navigation. Cost and time follow the tool-call count because tool calls are what drive both: each Read brings back tokens you pay for, each one costs a round trip of wall time. When navigation drops 39%, cost dropping 21% and time dropping 24% is the consequence, not a coincidence.

Now the per-task breakdown — the re-discovery tax (Read + Bash before the first write) for each task, vanilla versus vectr:

Task	Vanilla	Vectr	Delta
`debug_gc_finalizer`	16	6	−62%
`feature_dict_pop_last`	13	3	−77%
`cross_session_set_cartesian`	23	9	−61%
`debug_descriptor_priority`	6	6	0%
`cross_session_bytes_find_all`	13	2	−85%
`cross_session_list_rotate`	21	16	−24%

Five of six tasks show meaningful reduction. One — debug_descriptor_priority — shows nothing. That zero is more interesting than any of the wins, and it gets its own section. First, the demo lets you see the spread.

Interactive Demo

Per-Task Re-Discovery, Side by Side

See where the navigation savings come from and where they don't. Each task shows the re-discovery tax — file reads and greps before the first write — for the vanilla agent against the vectr agent, drawn to scale so the one task with no improvement is as visible as the four big wins.

Vanilla Vectr

Bars scale to the largest value (23, cross_session_set_cartesian vanilla). The flat pair on debug_descriptor_priority is the honest result — a task where the model's own knowledge already navigates efficiently.

The 0% Case: What Vectr Doesn't Help With

debug_descriptor_priority is a descriptor protocol bug — working out why __get__ fires at an unexpected priority relative to __getattribute__. The fix is a three-line change in a very specific part of the descriptor lookup chain. Vanilla's re-discovery was 6 calls. Vectr's was also 6. No improvement.

Why? My read: the descriptor protocol is one of the most thoroughly documented corners of Python's data model. Every senior Python engineer knows roughly where it lives in CPython source. The model's training knowledge is strong enough that it lands on the right file with high confidence on the first or second Read — not because it knows this codebase, but because it knows Python deeply. There's nothing for a search tool to discover that the model didn't already carry in.

There's a general principle here. Vectr's navigation advantage is largest when three things hold at once:

Vectr wins big

Codebase is large and unfamiliar
Naming isn't inferrable from domain knowledge
Code isn't in a well-known framework or stdlib

Vectr adds little

Well-known codebase the model has seen
Obvious, conventional naming
Deep training coverage of the exact area

debug_descriptor_priority sits hard at the right-hand column. The 0% is expected and honest. A tool that improved every task regardless of these conditions would be a red flag, not a feature — it would mean the benchmark wasn't measuring anything real.

Vanilla's re-discovery count predicts where vectr helps

The cleanest predictor is the vanilla re-discovery count: how many reads did the baseline agent need before its first write? If vanilla is low (≤5), the codebase was already easy to navigate and the search tools add little. If vanilla is high (≥12), vectr's improvement tends to scale with how unfamiliar the code is. Here, debug_descriptor_priority had vanilla = 6; cross_session_set_cartesian had vanilla = 23. Look at the baseline before you predict the win.

Interactive Demo

When Does The Research Cost Pay Itself Back?

Find your own break-even point. Set how many implementation tasks you run against the same research notes, and adjust the two costs that drive the amortization — the one-time research overhead and the per-sprint implementation saving. The calculator shows total cost both ways and the task count where vectr pulls ahead.

Implementation tasks 8

Research overhead ($, once) $1.27

Saving per task ($) $0.088

Vanilla total

$0.00

N tasks at baseline cost

Vectr total

$0.00

research once + N reduced tasks

Break-even at

—

tasks reusing the notes

Baseline per-task cost is set to $0.417 (vanilla impl $2.50 ÷ 6 tasks). Defaults reproduce the CPython run: $1.27 research overhead, $0.088 saved per task (−$0.53 ÷ 6), which puts break-even just under 15 tasks. The model is linear — it assumes notes stay relevant across all N tasks, the optimistic end; stale notes push the real break-even later, not earlier.

The B9 Bug and Why Earlier Runs Lied

The benchmarks didn't all start from a correct implementation, and that matters for reading any number from this series.

In runs B3 through B6, vectr_recall was broken: it used SQL LIKE substring matching instead of semantic search over the vector store. So even though the research phase had stored detailed notes about all six task areas, the implementation sessions' recall calls came back empty or near-empty. Those sessions paid the overhead of MCP setup and recall calls and got none of the benefit. Predictably, they showed vectr costing more than vanilla on implementation — which looked like a product failure but was a single broken function.

B7 was a full re-run after B9 fixed recall. The difference is stark.

Metric	Pre-B9 (broken recall)	Post-B9 (B7)
Recall firing with results	0 / 6 impl tasks	4 / 6 impl tasks
Impl session cost vs vanilla	+8% to +25%	−21%
R+B calls vs vanilla	−5% to +15%	−39%

The two tasks where recall still didn't fire have different explanations, and neither is a bug:

· debug_descriptor_priority — vanilla re-discovery was only 6, so recall would have helped minimally even if it had fired. Nothing to recover.
· cross_session_list_rotate — the research session stored notes, but the impl session's recall query didn't match them well enough. The notes were broad (general list internals); the task was specific (list.rotate() needed listobject.c's memory layout in detail). That's a recall precision problem, not a bug — the notes existed but weren't sharp enough.

Which is its own honest finding: vectr helps most when the research notes closely match the implementation tasks. Research that's too broad — storing general codebase knowledge instead of task-specific findings — dilutes the recall benefit. The granularity of what you write down is itself a variable that decides whether the tool pays off.

A bug can masquerade as a product verdict

For four benchmark runs, the data said "vectr makes implementation more expensive." That conclusion was wrong, and it was caused by one function using substring matching where it needed vector similarity. The lesson for anyone benchmarking their own tool: a bad number is a hypothesis about your code as much as a verdict on your idea. Before you trust a disappointing result, confirm the path it depends on actually works end to end.

The Tool Usage Breakdown

Across the six implementation sessions, the vectr agent's tool usage tells you exactly how the savings happen:

Tool	Calls	Primary use
`vectr_recall`	4	Session start — retrieve research notes.
`vectr_status`	5	Session start — check if notes exist.
`vectr_search`	1	Mid-session — find a specific function not in notes.
standard Read	62	File reads after navigation.
standard Bash (grep)	0	Replaced entirely by `vectr_locate`.

The pattern is clean: vectr tools replace Bash (grep down to zero) and substantially cut Read. The dominant use is status + recall at session start, not mid-session search — which is exactly the intended design. Recall the research notes first, implement from them, only reach for vectr_search for the specifics the notes didn't cover.

The zero Bash calls deserve a closer look. Vanilla implementation sessions leaned on grep heavily for exact symbol finding — grepping function names, class definitions, imports. The vectr sessions used vectr_locate for the same job. The difference: vectr_locate returns the definition's location without reading the file, while grep is a Bash call that also drags back surrounding context you then pay to process. Same question, cheaper answer.

Part 3

The Limits Of What This Proves

What A Controlled Benchmark Can't Measure

Every controlled benchmark simplifies in ways real usage doesn't. Here's where I think this methodology is weakest — stated plainly, because a limitations section that buries the real gaps is just marketing with footnotes.

Fixed codebase version

Both agents work on the same frozen CPython. In real life the codebase moves under you, and stale notes turn into a liability — a note written against an older version may describe behavior that has since changed. The [STALE] marker helps more than I first assumed: check_staleness flags a note when a referenced file's mtime is newer than the note, when the file's content hash stops matching (a SHA-256 check on .py/.c/.h/.go/.rs), or when the note has been explicitly superseded — so content changes are caught. Its real blind spot is the opposite of what you'd guess: if a referenced file is renamed or deleted, the check hits a missing path, swallows the error, and silently skips it rather than flagging it. And the deeper limit still stands — a file can change in a way that moves its hash without invalidating your note, or change meaning without moving either signal. The right question for a stale note isn't "does the file still exist?" but "does the note still accurately describe the code?" — and that's much harder to detect automatically. In practice you have to build the habit of verifying a note against current code before trusting it for precise work.

Two agents, no human in the loop

This compares agent against agent. It does not measure human productivity. An engineer who's lived in a codebase for six months already carries the equivalent of research notes in their head; vectr's most direct benefit is the human-plus-AI workflow where the AI does exploration that would otherwise burn many turns. "Vectr makes developers X% more productive" is a stronger claim than this benchmark can support, and I'm not going to make it.

Six tasks in one codebase area

All six tasks live in CPython's core implementation, which shares key files and modules. So the research notes from one task are partially reusable for others — not by design, but because the code regions overlap. That inflates the apparent benefit of note reuse. A more rigorous version would use tasks in completely non-overlapping regions, to measure recall precision independently of accidental cross-task reuse.

No quality measurement

We measured cost, time, and tool calls. We did not measure correctness in any deep sense. Did both agents produce correct implementations? Did the vectr agent ever ship a bug because it trusted a stale recall instead of reading current code? In our sessions both agents passed the relevant CPython tests — but "passed tests" is a coarse correctness measure for a bug-fix task. I'd want task-specific human review before claiming quality held.

Cheaper and faster but wrong is worse than slow but right

A faster, cheaper session that produces wrong output is strictly worse than a slow, expensive one that produces correct output. Never read a cost or time reduction as a quality improvement without separately checking correctness. These benchmarks checked test passage, not human review — that's a real methodological gap, and pretending otherwise would undercut everything else here.

Why The Real Benchmark Is Adoption

The most honest thing I can say about controlled benchmarks for a tool like this: they're useful for debugging and for showing the tool doesn't make things worse, but they're not how you find out whether it actually helps in production.

The CPython and Camel runs were designed by me, run by me, with agents I configured. The research phase was written by me — so the quality of those notes reflects my sense of what's worth writing down. An independent user, on a codebase they actually work in, writing notes about things they actually need, will produce different notes, different recall patterns, different tasks. Their numbers won't be mine.

The right benchmark for a developer tool is many users, many codebases, over time, with self-reported productivity. I can't manufacture that. What I can do is make vectr cheap enough to try that real users run it on their own code and find out for themselves. The published numbers aren't the product — they're evidence the approach isn't broken. Adoption is what produces the signal that matters.

Two things make that experiment near-free to run:

Local model — no API cost, no data leaving the machine

The embedding model runs locally. The barrier to "try it on your proprietary codebase" is essentially zero, because you're not sending your code anywhere. For most teams that's the difference between "maybe later" and "I'll run it tonight."

MCP protocol — no per-editor plugin

One vectr server works with Claude Code, Cursor, VS Code + Copilot, Windsurf, Continue, and Cline because they all speak MCP. Integration cost is two lines of JSON config, not a plugin per editor.

So the bet is asymmetric. If vectr fits your situation — large, unfamiliar, unconventionally named code — the savings are real. If it doesn't — small, well-known, deep training coverage — you've lost the five minutes it took to install. That's a reasonable experiment, and it's a far better source of truth than any number I can publish.

Reading The Benchmark Numbers Honestly

Let me be direct about what I think the numbers show and don't.

What they show

On 5 of 6 CPython tasks, re-discovery dropped meaningfully (−24% to −85%)
Across all 6 impl sessions: cost −21%, time −24%, R+B −39%
The B9 semantic-recall fix was critical; without it, vectr added cost
The research vs impl distinction matters; +19% total is the wrong headline

What they don't show

Whether this generalizes to your codebase and task type
Whether quality holds (test passage only, not human review)
Long-term stale-note dynamics over weeks of real work
Human productivity — we measured agents, not human+AI

And to make the boundary concrete rather than abstract:

Expect least help

Well-known frameworks with deep training data (stdlib, popular internals)
Small codebases (under ~500 files) where grep is fast
Short tasks (under ~30 tool calls) where setup overhead dominates

Expect most help

First week in an unfamiliar codebase with non-obvious naming
Multi-session work on a large proprietary system where /compact fires
Cross-cutting tasks spanning three subsystems before you touch one

What I'd Measure Next

If I were extending this, four additions would matter more than another CPython run.

Quality scoring by human review

Test passage is necessary, not sufficient. I'd want at least two engineers independently assessing each output for correctness, idiomatic quality, and edge-case handling — the things a test suite quietly lets through.

Stale-note tracking

After four weeks of normal development, how many notes have gone stale? Does the [STALE] marker catch the important ones, given it keys on mtime and content hashes rather than meaning? How much does relying on a stale note degrade output? This is the gap between the benchmark's frozen codebase and real life, and it's the one I'd close first.

Independent user runs

Give vectr to five engineers who've never used it, on codebases they pick, with tasks they set — and collect their before/after observations without telegraphing the result I expect. That's the closest thing to the real benchmark that I could actually run.

Recall at scale

Our research sessions stored 15–20 notes per task area. Real long-term use might accumulate 200–500 notes over months. Does recall precision hold at that scale? Does the tag-and-priority system do enough to keep recall relevant when the haystack is an order of magnitude larger? And cross_session_list_rotate showed the failure mode worth characterizing precisely: what note granularity is needed for what task type?

★

Conclusion — Three Posts, Three Layers

Three posts, three layers of the same system, and now they close.

Part 1 — the indexer. AST-aware chunking, code-specific local embeddings, hybrid search, a symbol graph with five fallback strategies. The goal: find any function or concept in a 10,000-file codebase in under 20ms, in a single call.

Part 2 — working memory. Notes that survive /compact and session boundaries. Semantic recall that retrieves by concept, not substring. The correct framing — recall-cost avoidance, not token release — and what it took to get there after building the wrong thing first.

Part 3 — measurement. The research vs implementation distinction that makes the +19% total sprint cost misleading. The five of six tasks where re-discovery fell sharply, the one where it didn't, and why that one is more informative than the five that worked. The limitations that decide whether any of this applies to you.

The honest version of the value proposition: on large, unfamiliar codebases, vectr cuts the re-discovery tax an AI code editor pays on every implementation session. The research investment compounds across tasks. The implementation savings are real enough that somewhere past the dozen-or-so-task mark — fourteen on the CPython numbers — the math turns positive and stays that way.

Whether that's your situation — large and unfamiliar enough, enough tasks in the same area for the investment to pay off, the five-minute install worth the experiment — is the one thing I genuinely can't tell you from here. The benchmark shows the tool isn't broken. Adoption shows whether it works for you. That's the only honest place to end a series about measuring your own work.

↑ Back to top

References & Further Reading

Benchmarking & Evaluation

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Jimenez et al. · ICLR 2024
The reference point for evaluating coding agents on real repositories. Useful contrast: SWE-bench measures task resolution, not navigation overhead — the metric this post argues is the missing one for memory tools.
Lost in the Middle: How Language Models Use Long Contexts
Liu et al. · TACL 2024
Why piling navigation context into the window hurts rather than helps — the positional degradation that makes reducing the re-discovery tax worth doing in the first place.

Protocol & Tooling

Model Context Protocol Specification
Anthropic · 2024
The open standard behind the "one server, every editor" claim. Why vectr ships as an MCP server rather than per-editor plugins.
MemGPT: Towards LLMs as Operating Systems
Packer et al. · arXiv:2310.08560
The virtual-context framing that informed vectr's working-memory design and the thresholds for prompting note-saving before context degrades.

Vectr

Vectr — Semantic Code Search + Working Memory for AI Editors
Tool page · swapnanilsaha.com
Setup instructions, documentation for all 13 MCP tools, and the CLAUDE.md template referenced throughout this series.
Building Vectr, Part 1: Why grep Fails When You Don't Know the Keywords
swapnanilsaha.com · 9 June 2026
The indexing layer the implementation savings rest on: AST chunking, hybrid BM25+vector search, symbol graphs.
Building Vectr, Part 2: What /compact Destroys and How to Survive It
swapnanilsaha.com · 11 June 2026
The working-memory layer the research phase depends on, including the B9 bug that this post's numbers were re-run to fix.