Building Vectr — Part 2 of 3

What /compact destroys, why you can't fix it by telling the AI to forget things, and how working memory actually works — three bugs included.

Building Vectr Series

Session three of a bug hunt in CPython's garbage collector. Two sessions in, I had what felt like a solid map: the exact call chain from PyObject_GC_Del through the generational collector, the non-obvious invariant around finalizer ordering, the three files where the relevant logic lived. Then /compact fired.

The summary said something like: "we were investigating CPython's garbage collector, specifically the interaction between finalizers and the generational GC." Accurate. Useless. The exact function signatures were gone. The specific line numbers were gone. The invariant that took two sessions to understand — compressed to one sentence that had lost all the nuance. The next 20 minutes: re-reading files to rebuild what I already knew.

This post is about what I learned from that, and from the working memory system I built to prevent it. Part 1 covered the indexing layer — how Vectr finds things in a codebase semantically. This part covers what happens after you find something: how to keep the knowledge alive across session boundaries, why my initial design was wrong in a fundamental way, and what actually works.

Part 1
The Problem With /compact
01

What /compact Actually Destroys

Most people treat /compact as "clear the context to keep going." That framing is roughly correct but understates the damage. The issue isn't just that context gets shorter — it's that the compression is lossy in exactly the cases where being wrong is most expensive.

/compact works by asking the AI to summarize the current conversation, then replacing the full history with that summary. Token count drops from (say) 180,000 to 12,000. You continue from the summary. Here's what the summary doesn't preserve:

Exact function signatures

A summary might say "the function takes a path and a flag." The conversation had def process_workspace_changes(path: Path, db: Database, *, force: bool = False) -> list[ChangeResult]. The difference between those two descriptions is the difference between a valid call site and a runtime error.

Specific line numbers

"The resolver module" and /src/workspace/resolver.rs:214 are not the same precision. You can reconstruct the file path, but it costs you a tool call — and often another read of the surrounding code to re-establish context.

Non-obvious behavioral invariants

If you spent three turns establishing that acquire_lock() must be called before touching workspace metadata because there's a race condition with the filesystem watcher, that three-turn understanding might survive as "be careful with locking." The exact invariant — the one that matters when you're writing the code — is gone.

The reasoning chain

Sometimes the value of an exploration session isn't the final answer but the chain of observations that produced it. "We looked at X, noticed Y, which suggested Z" — this chain is what you need if the conclusion turns out to be wrong later. Summaries discard chains. They keep endpoints.

The asymmetry of summary loss

Summaries are fine for preserving topics and general direction — you don't need to remember what happened in turn 17 to keep making progress in turn 80. The trouble is that summaries fail specifically at exact signatures, line numbers, and subtle behavioral invariants — which is also where being wrong is most expensive. A summary of "be careful with locking" covers the topic. It doesn't tell you which function must be called first, or why, or what breaks if you get it wrong.

In the CPython scenario, re-establishing the finalizer ordering invariant from scratch means re-reading several files and re-following a non-obvious call chain — roughly 15–20 minutes of work that was already done. A note stored at the end of session two takes about a minute to write and ten milliseconds to retrieve. Session three starts with vectr_recall("garbage collector finalizer ordering") and picks up from actual precision rather than reconstructing from a fragment.

02

Why You Can't Tell the AI to Just Forget Things

When I started building Vectr's memory layer, I had a clean model: the AI finds something useful, stores it with vectr_remember, then drops the file from its context window. The note is 50 tokens. The file was 800 tokens. Net gain: 750 tokens freed for new content. I called this "context offload."

I built it this way. I wrote documentation describing it this way. I designed vectr_evict_hint entirely around it — a tool that identifies which retrieved chunks the AI had already "processed" and could therefore "drop."

It doesn't work. And the reason explains something important about how transformer models handle context.

The KV cache is append-only

Think of the transformer's memory as a lookup table it builds as it reads each token. For each token it processes, it computes a key-value representation that gets stored at each attention layer. Every subsequent token the model generates attends back to every previous token through these cached representations — that's how earlier context influences later output.

Once a token's representation is computed and cached, it stays until the context is cleared. There is no mechanism to evict specific tokens by instruction. "You can drop chunk X from your context window" is itself processed as tokens — added to the cache, not used to remove other entries from it.

A subtlety worth naming: the KV cache as described here is the model's internal representation, maintained server-side by the inference provider. What you see as "context window usage" is a count of tokens in the current conversation, not a direct readout of GPU memory. The principle holds regardless: every token in the conversation occupies a slot in the cache, and you cannot remove individual tokens from a running session without ending or compressing the whole thing.

KV cache memory cost per sequence

For a transformer with L layers, n_heads attention heads, d_head dimension per head, and a sequence of T tokens, the KV cache size is:

KV cache size = 2 × L × n_heads × d_head × T × bytes_per_float

The 2× is because you store both keys and values separately. For a representative mid-size model: L=32 layers, n_heads=32, d_head=128, T=50,000 tokens at fp16 (2 bytes):

2 × 32 × 32 × 128 × 50,000 × 2 = 13.1 GB

The cache grows linearly with sequence length T. No selective removal. The operations that genuinely reduce context are: end the session (total loss), use /compact (precision loss), or rely on provider-side prefix caching — which stores stable prefix representations like system prompts to avoid recomputing them, but doesn't remove anything from your active context budget.

I measured context window usage before and after sequences of vectr_remember + vectr_evict_hint calls: essentially unchanged. The hint was adding tokens to the cache while accomplishing nothing at the context management level. In some cases it made things marginally worse.

This misconception cost real development time. I built vectr_evict_hint to fire automatically when cumulative retrieved token count crossed a threshold, expecting the AI would see the hint, store notes, and drop files. What actually happened: the AI would see the hint, store notes (genuinely useful!), and then also keep the files in context (because nothing dropped them). Benchmarks with this framing baked in gave confusing numbers — sometimes Vectr cost more than the baseline, even on tasks where it should help. The accumulated search results staying in context without any actual eviction was part of that.

The context offload myth

Any tool or documentation claiming "store to external memory to free context budget" is describing something the system cannot deliver. Tokens in a live context window cannot be selectively evicted. Working memory tools are genuinely valuable — but not for freeing active context. Building around that claim confuses your benchmarks and misleads anyone using the tool.

Part 2
What Working Memory Actually Does
03

Three Tiers of Value, Ordered by What Matters

Once I dropped the context-offload framing, the actual value of vectr_remember became clear. It operates on three time horizons. The value increases as you move further in time — and the first tier is honestly the least important one.

Tier 1
In-session re-read avoidance
Within a single session, before any /compact: recalling a stored note costs ~50 tokens instead of re-reading the original file at ~600 tokens. Real savings, but the file is still sitting in your context window anyway. Reduces redundant reads; doesn't change the fundamental context budget situation.
Genuinely useful, but not the reason to build this.
Tier 2
/compact survival
When /compact compresses the conversation, notes stored on disk (SQLite + ChromaDB) are untouched. Exact signatures and behavioral invariants survive verbatim — not as a summary, not as a paraphrase. The session resumes from actual precision.
This is where the system earns its cost.
Tier 3
Cross-session persistence
Between separate sessions — the editor closed and reopened — the AI starts with nothing. Notes survive. A new session calling vectr_status() + vectr_recall() recovers findings from sessions ago without re-reading a single file.
Each session builds on the ones before it.

Tier 2 is the one that makes a concrete difference in practice. When /compact fires mid-investigation, there are two scenarios. Without notes: the session resumes from a lossy summary, and you spend time re-establishing the precise understanding you had before. With notes: vectr_status() shows notes exist, vectr_recall("finalizer ordering") retrieves the exact invariant you stored, and the session continues from where the understanding actually was — not where the summary says it was.

Tier 3 compounds in a way that's easy to underestimate. The first session on a complex codebase pays the discovery cost. The second session pays it again if there are no notes, or skips it almost entirely if there are. By the tenth session, a well-maintained note store is a persistent model of the codebase that makes every session faster. The initial investment in careful note-taking earns compound returns.

Analogy — The surgeon's notes

A surgeon takes detailed notes before starting a complex procedure. Halfway through, an emergency calls them away for two hours. When they return, they have two options: (a) their notes are on the desk — exact measurements, named vessels, where they left off; or (b) a colleague wrote a summary: "patient is partially through a vascular procedure, some complications noted." Option (b) is dangerous. Option (a) lets you continue precisely. vectr_remember is option (a). /compact without notes is option (b).

Why Tier 3 compounds

The first session on a codebase pays the discovery cost once. The second benefits from the first session's notes. The tenth session benefits from nine. Every subsequent session reclaims work that was already done, without re-reading files. This is the math that makes the research-phase overhead worthwhile — the upfront cost amortizes across every subsequent session on the same codebase.

04

What to Store and How to Store It

The design of the notes themselves matters as much as the fact that they exist. Two principles drive everything else.

Don't store file pointers

"See resolver.rs:214 for the lock implementation" is a bad note. Why: it gives you a file path and line number. File paths change during refactoring. Line numbers change with every edit. A note from three days ago that says "line 214" will be wrong by the time you use it if anyone has touched the file.

More fundamentally: a pointer hasn't captured what you learned. It's a reference. When you recall it, you still have to read the file to extract the actual knowledge. You've stored the address, not the content at the address.

Store the finding itself

A good note, using the same codebase as an example:

WorkspaceLock: defined at resolver.rs:214 (as of 2026-06-08)
- acquire(): blocks if .vectr_lock exists; writes current PID + timestamp
- release(): validates PID match before deleting lock file
  (returns Err if mismatch — this is intentional, not a bug)
- CRITICAL: acquire() must be called BEFORE touching workspace
  metadata. The filesystem watcher reads metadata; touching it
  without holding the lock fires an invalid re-index.
  This caused the race condition in issue #1247.

Key callsites: workspace.rs:89 (init), daemon.rs:203 (shutdown)

This note is about 120 tokens. Reading the relevant files to reconstruct this knowledge would cost 600+ tokens plus two turns. The note captures the actual insight — the non-obvious invariant about lock order — not just a pointer to where to find it. When vectr_recall surfaces this in a later session, the AI continues from the right understanding immediately.

Priority and tags are not cosmetic

vectr_remember takes priority ("high", "medium", "low") and tags (a list of strings). Priority affects recall ordering: high-priority notes rank higher when multiple notes match a query with similar relevance scores. In a long investigation where many findings are equally relevant, high-priority notes surface first — the ones you explicitly flagged as critical.

Tags enable filtered recall. vectr_recall(query="locking", tags=["concurrency"]) returns only notes tagged with "concurrency" that semantically match the query. On a codebase with hundreds of stored notes accumulated over months of work, filtering by subsystem tag makes recall precise without returning results from unrelated parts of the codebase.

content The actual finding — function behavior, invariants, non-obvious patterns. Not a file path.
tags Subsystem labels: ["locking", "workspace", "concurrency"]. Used for filtered recall in large note stores.
priority "high" for invariants that are easy to violate, "medium" for behavioral descriptions, "low" for supporting detail.
note_id Unique identifier assigned at storage time. Returned in every recall response alongside the note content. Used with vectr_forget(note_id) to delete the note when it goes stale after a refactor.
Interactive Demo
Recall Precision: Note vs. File Pointer vs. Summary
Select a recall scenario and see what a session recovering from /compact actually receives — the difference between a well-stored note, a file pointer, and a /compact summary for the same piece of knowledge.
This demo uses static example content to illustrate the precision difference. The scenarios represent the three recovery strategies when a session resumes after /compact on a codebase investigation involving CPython's garbage collector finalizer ordering.
Part 3
The Bugs That Shaped the Design
05

The B9 Bug: When Recall Doesn't Recall

For several early benchmark runs, vectr_recall was firing in implementation sessions but returning nothing useful. The benchmark showed 0 relevant recall results across 5 separate implementation sessions on CPython tasks — even though the research session had stored detailed notes about exactly the functions being modified.

The root cause took some logging to surface: recall was using SQL LIKE queries against a text field, not semantic search against the vector store.

Python — the broken recall implementation (pre-B9)
def recall(query: str) -> list[Note]:
    return db.execute(
        "SELECT * FROM notes WHERE content LIKE ? LIMIT 20",
        (f"%{query}%",)
    ).fetchall()

SQL LIKE is substring matching. vectr_recall("garbage collector finalizer ordering") would only return notes that contained the exact string "garbage collector finalizer ordering" somewhere in their text. A note about PyObject_GC_Del describing finalizer behavior — stored with different wording in a different session — wouldn't match unless it happened to contain those exact words in that exact order.

The fix: use the ChromaDB vector store for recall, not the SQL table. Notes are embedded when stored and retrieved by semantic similarity when recalled.

Python — the correct recall implementation (post-B9)
def recall(query: str, tags: list[str] | None = None) -> list[Note]:
    results = chroma_collection.query(
        query_texts=[query],
        n_results=10,
        where={"tags": {"$in": tags}} if tags else None,
    )
    return [Note.from_chroma(r) for r in results]

The impact was immediate and large. In the full CPython re-run after B9 was fixed, vectr_recall fired in 4 of 6 implementation sessions with relevant results, compared to 0 of 6 in earlier runs. Per-task re-discovery costs dropped correspondingly.

This bug sat undetected for several benchmark runs because the initial benchmark design didn't make empty recalls visible. It was only when I added per-tool logging — "vectr_recall called N times, N empty responses" — that the pattern became obvious. The right signal for a recall system is not just "how many times was it called" but "how many times did it return something useful."

SQL LIKE is not a recall mechanism

SQL LIKE requires the query string to be a literal substring of the stored content. Semantic recall needs to handle paraphrase, synonym, and concept-level matching — none of which substring search handles. If you're building any retrieval system where queries and stored content are written by different people (or by the same person at different times), SQL LIKE is not just suboptimal, it's functionally broken for most queries.

06

vectr_evict_hint: What It Actually Does After the Reframe

After fixing the context-offload misconception, I kept vectr_evict_hint in the tool set but reframed it completely. It now does something genuinely useful — just not what I originally thought.

What it actually does: it tracks the cumulative token cost of all code chunks Vectr has retrieved in the current session. When this cost crosses a threshold (default: 40,000 tokens, or more than 20 tool calls), it appends a hint to the next search response:

[vectr_evict_hint] You've retrieved ~42,000 tokens of indexed chunks
this session. The following chunks are fully indexed and re-retrievable
in <50ms — no need to re-read these files later:

  - resolver.rs:214  WorkspaceLock::acquire  (retrieved 8 turns ago)
  - resolver.rs:267  WorkspaceLock::release  (retrieved 8 turns ago)
  - workspace.rs:89  init call site          (retrieved 5 turns ago)

Consider calling vectr_remember now if you have key findings you
haven't stored yet.

The important word is "re-retrievable," not "droppable." The hint doesn't claim to free tokens. It tells the AI: these files are in the index, you can get them back in under 50ms if you need them, so don't re-read them out of caution when you already have what you need in a note or could re-search instantly.

This is a behavioral nudge, not a memory management operation. The value is discouraging the pattern where an AI re-reads files "just to make sure" — burning tokens on a re-read that would have yielded the same information as a fast re-retrieval.

The threshold values — 40K retrieved tokens or 20 tool calls, whichever comes first — come from research on MemGPT (arXiv:2310.08560), which found that models begin exhibiting "lost in the middle" degradation at roughly 70% context fill. Using a disjunction (first threshold reached triggers the hint) keeps the hint from firing too late on sessions that accumulate few large files but many small searches. The 40K token count is approximate — Vectr tracks retrieved chunk tokens but not system prompt or conversation history tokens, so 40K retrieved puts real fill somewhere between 60–80%, which is where you want the prompt to land.

Lost in the middle

Multiple studies have shown that LLM performance on retrieval tasks follows a U-shaped curve over context position: accuracy is highest for content at the beginning or end of the context window, and degrades significantly for content in the middle — sometimes by 30% or more. This isn't about context length per se; it's about positional attention patterns. The evict_hint threshold is set conservatively to prompt note-saving before the relevant information drifts into the degraded middle zone.

Part 4
The Mechanics of Actually Using It
07

The Save-Moment Problem

Knowing that notes are valuable doesn't make the AI store them. In early sessions, vectr_remember call rates were low — not because the AI couldn't see the tool, but because there was no clear trigger for "now is the moment to save this."

Saving notes is a habit humans develop from experiencing loss. An AI editor in session 1 of a new codebase has never lost anything to /compact here — it's optimizing for the task in front of it, not for a compression event that might happen three hours from now.

The pair pattern

The solution was making the save-moment explicit and concrete in the CLAUDE.md template that Vectr writes into a workspace. The original documentation said things like "save findings when they seem important." The revised version:

**The moment you find a key definition, pattern, or non-obvious detail:**
call vectr_remember(content, tags=[...], priority="high"|"medium"|"low")
— store the actual code block or finding, not a file pointer.

Treat every vectr_search or vectr_locate call as a **pair**: search,
then immediately save the key finding before your next retrieval.

If /compact runs later, the conversation summary loses exact signatures
and line numbers — your note does not.

"Treat every search call as a pair: search, then save." This turned out to be the most effective framing. Not "save when it feels important" (too vague, relies on judgment in the moment) but "pair every retrieval with a note" (concrete, actionable, a habit that fires on a specific trigger).

The benchmark sessions that stored the most notes also had the lowest re-discovery costs in subsequent tasks. The correlation is strong enough that note count at end of research phase is a reasonable proxy for how useful the subsequent implementation sessions will be.

When not to search: the SR-RAG finding

The pair pattern addresses "when to save." There's a complementary question that ended up in the same CLAUDE.md template: when to search at all. Before calling vectr_search on a well-known API or framework, the AI should first write out what it already knows — function signatures, key types, parameter names — and only search if genuine gaps remain after that.

This comes from a paper on SR-RAG (arXiv:2504.01018). The finding: models often retrieve information they already know from training — information baked in during training, before you even opened this codebase — adding token cost without improving answer quality. Writing out what you already know before searching reduces unnecessary calls by 26–40% on familiar codebases. On an unfamiliar codebase, the AI's training knowledge rarely applies — every search turns up something new. On well-known frameworks like SQLAlchemy or React, training knowledge is often more accurate than indexed documentation. The verbalization step surfaces which situation you're actually in.

The CLAUDE.md as behavioral infrastructure

The CLAUDE.md file that Vectr writes into a workspace is as important as the tools themselves. A tool that exists but has no clear trigger for use is nearly invisible. The save-moment instruction — "pair every search with a save" — is the difference between a note-taking system that accumulates useful knowledge and one that sits mostly unused. Getting this instruction right required several iterations against benchmark data to find wording that produced consistent behavior.

08

Snapshots: Checkpointing an Investigation

Beyond individual notes, there's a use case for checkpointing entire session states. You've spent three sessions mapping a complex subsystem, accumulated 15 detailed notes. You want a named marker that represents "the state of my understanding of the locking subsystem as of 2026-06-08" — distinct from the ongoing notes you'll add in future sessions.

vectr_snapshot("lock-subsystem-mapped") seals the current note set under a named label with a timestamp. It doesn't delete notes — it timestamps and labels the current state. vectr_snapshot_list() at session start shows all checkpoints.

The typical workflow for a complex multi-session investigation:

  • 01
    Exploration sessions Sessions 1–N: explore, call vectr_remember on each key finding. Pair every search with a save.
  • 02
    Exploration complete At the natural transition: vectr_snapshot("exploration-complete"). This seals the note state for this phase under a named label.
  • 03
    Implementation sessions vectr_status()vectr_recall(query) → build on the snapshot. New findings get added as new notes — they don't overwrite the snapshot.
  • 04
    Implementation done vectr_snapshot("implementation-done"). Now you have two named checkpoints marking the investigation arc.
  • 05
    Revisiting months later vectr_snapshot_list() shows the investigation history with timestamps. vectr_recall retrieves from all snapshots. The snapshot timestamp tells you which notes were established before a given change.

The snapshot name becomes the identifier for "what did I know when I made this decision." If a refactor breaks something two months later and you need to reconstruct the reasoning, the snapshot timestamps tell you which notes predated the change and which were made after it. This is the kind of temporal context that summaries and logs don't give you.

09

When Notes Are Wrong: vectr_forget

Notes can be wrong. A note about a function's behavior written before a refactor may describe the old behavior. A note about an API endpoint may describe the pre-migration version. Stale notes are worse than no notes — they give the AI false confidence in outdated information.

vectr_forget takes a note ID and deletes it from the store. Every vectr_recall response includes note IDs so you can act on them inline:

[Note abc123, priority: high, tags: [locking, resolver]]
WorkspaceLock: acquire() blocks if .vectr_lock exists...
(saved 2026-05-14)

[Note def456, priority: medium, tags: [workspace, metadata]]
Workspace metadata is stored in ~/.cache/vectr/{hash}/meta.json...
(saved 2026-05-08)

If the locking implementation changed in a refactor since May 14: vectr_forget("abc123"), then store a fresh note. The workflow is: recall → verify against current code → forget the stale note → store the updated one.

Vectr also appends a [STALE] marker automatically when a file path extracted from a note's content no longer exists in the workspace. The extraction is a simple regex scan for paths that look like foo/bar.py or src/resolver.rs — not deep parsing. When those paths disappear from the file tree, the note gets flagged. This fires on file deletion and rename, both common refactoring outcomes. It's a prompt to verify, not an automatic deletion, and it only catches path-level staleness — not behavioral changes in files that kept their names.

What [STALE] does and doesn't catch

The [STALE] marker fires when a file path referenced in a note no longer exists (deleted or renamed). It does not fire when file content changes. A note about function behavior after a refactor that renamed the file gets flagged; a note about function behavior after a refactor that changed the logic without renaming gets no warning. Always verify behavioral notes against current code before acting on them for implementation work, regardless of whether [STALE] is present.

10

The Design Principle I'd Rephrase

Looking back at the original Vectr documentation for working memory, almost every sentence led with the wrong framing. "Store to vectr, then drop from context." "Offload findings to free context budget." "Context offload layer." Every one of these is technically false, and I shipped all of them.

The correct version is shorter: store findings now so you can recall them precisely later. Through /compact. Through a new session. Through however many turns separate the discovery from the moment you need to use it. The value is in the later. The storing is cheap — a minute of attention and a tool call. The recalling is where you get the hours back.

Everything else in the system — priority, tags, snapshots, the evict_hint threshold — is infrastructure for making that recall precise. None of it matters if the fundamental framing is wrong. And mine was wrong for longer than I'd like to admit.

If I were writing the documentation from scratch I'd lead with the /compact scenario. Not with "context offload" or "memory management" — with the specific moment when a detailed understanding of a complex system compresses into a three-sentence summary that can't be acted on. That's the moment where a stored note is worth exactly what it cost to write it.

Key decision
Semantic recall, not SQL LIKE
Notes are embedded when stored, retrieved by vector similarity when recalled. SQL substring matching breaks across paraphrase and synonym — which is exactly what cross-session recall needs to handle.
Key decision
Store the finding, not the pointer
File paths drift. Line numbers drift. The actual behavioral description — the invariant, the signature, the gotcha — stays valid across refactors. Notes should capture knowledge, not references to knowledge.
Key decision
Pair every search with a save
The CLAUDE.md prompt pattern that makes AI editors actually use working memory consistently. Concrete trigger, immediate action. Vague instructions to "save when important" don't produce consistent behavior.
Key decision
evict_hint as behavioral nudge
Not a memory management operation — a reminder that indexed chunks are re-retrievable instantly. Discourages unnecessary re-reads. Fires before the "lost in the middle" degradation zone based on MemGPT thresholds.

What's Next

The part I haven't answered yet: does any of this actually save time? Not in the abstract — in real benchmarks, on real codebases, compared against an AI editor with no indexing and no memory. The number I care about is not total session cost (which includes upfront research overhead that inflates the naive comparison) but re-discovery cost per task across repeated sessions on the same codebase.

Part 3 covers that measurement — including why the total sprint cost comparison is almost exactly the wrong metric to report, and what the data from CPython, Django, and Apache Camel actually showed once I separated research overhead from implementation savings.

If you want to try Vectr now, the tool page has setup instructions. The full working memory layer — vectr_remember, vectr_recall, vectr_snapshot, vectr_forget — is in the current release alongside the semantic search tools from Part 1.

↑ Back to top

References

Working Memory and Context Management
  • MemGPT: Towards LLMs as Operating Systems
    Packer et al. · arXiv:2310.08560 · 2023
    Source of the 70% context fill threshold for memory pressure warnings. Virtual context management framing. The paper that named the pattern Vectr's evict_hint threshold is calibrated against.
  • Lost in the Middle: How Language Models Use Long Contexts
    Liu et al. · arXiv:2307.03172 · 2023
    U-shaped performance curve over context position — accuracy highest at start and end, degrading for middle content. Empirical basis for why notes stored externally outperform summaries for precision-sensitive recall.
  • Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization
    arXiv:2504.01018 · 2025
    SR-RAG finding: verbalizing parametric knowledge before retrieval reduces unnecessary search calls by 26–40% on familiar domains. Basis for the "verbalize before searching" instruction in the CLAUDE.md template.
Transformer Architecture
  • Attention Is All You Need
    Vaswani et al. · NeurIPS 2017
    Original transformer paper. The attention mechanism and KV cache structure described here is the core of every current LLM.
  • A Survey on LLM Acceleration Based on KV Cache Management
    arXiv:2412.19442 · 2024
    Comprehensive review of KV cache management techniques. Confirms the append-only nature of live context KV caches and covers provider-side eviction strategies that operate outside agent control.
Vectr