The Complete Guide to Text Embeddings, Vector Databases & LLMs

From a raw character stream to billion-parameter models — how machines understand language, search meaning, and generate text.

There's a question buried inside every AI product you've ever used: how does a machine understand what you mean? Not the characters on screen — the meaning. The difference between "I made a deposit at the bank" and "I sat by the river bank" is obvious to any human and was genuinely unsolved computer science for most of the 20th century.

This guide is about how we solved it. Starting from the ground up: why raw text is computationally intractable, how we convert it to numbers, how attention mechanisms give those numbers context, how embedding models learn the geometry of meaning, how vector databases search that geometry at billion-document scale, and how it all assembles into the RAG pipelines running inside most AI products today.

By the end you will understand — not just know, but understand — what happens mathematically and computationally every time an AI system searches, retrieves, or generates text. Seven interactive demos are embedded throughout so you can feel the concepts, not just read about them.

Part 01
The Foundation
01

The Problem with Text

If you want a computer to process an image, the path is clear: each pixel is a number, and you pass that grid of numbers through a neural network. The representation is compact and unambiguous. Text gives you none of that.

Consider: "I love the bank by the river" and "I deposited money at the bank." One word — bank — means completely different things. A child understands this without thinking. For a computer operating on discrete symbols, it's a representation nightmare.

The earliest computational approaches to language were brutally mechanical. Bag-of-words treated a document as a word-frequency histogram: "the cat sat on the mat" became {cat:1, mat:1, on:1, sat:1, the:2}. Word order destroyed. Context gone. Meaning invisible. Search for "joyful" and you'd miss every document that said "happy." TF-IDF improved recall by weighting rare words more heavily, but the fundamental problem remained — no concept of semantic similarity, only character matching.

N-grams helped a little. Tracking pairs and triples of consecutive words preserved some local structure. "New York" as a bigram at least stays together. But N-grams scale catastrophically: a vocabulary of 50,000 words produces 2.5 billion possible bigrams. Almost none appear in any given document. The representation becomes impossibly sparse.

The insight that changed everything: meaning is geometric. Words don't just have definitions — they have relationships to each other. Dog and puppy are close. Dog and cat are close but less so. Dog and democracy are far apart. If we could build a coordinate system where these distances were geometrically real — where similar concepts occupied nearby regions of some high-dimensional space — then we could do arithmetic on meaning. We could search for "happy" and retrieve "joyful." We could take the vector for "king," subtract "man," add "woman," and arrive at "queen."

That coordinate system is called an embedding. Everything in this guide flows from that one idea.

02

Tokenization

Before we can embed anything, we need to convert text into a form a model can ingest. That form is not characters, and it is not words. It's something in between: tokens.

Using individual characters creates sequences that are impossibly long. "Hello world" becomes ['H','e','l','l','o',' ','w','o','r','l','d'] — eleven elements to convey two words. Models struggle to learn word-level patterns from character streams. But using whole words has the opposite problem: English has hundreds of thousands of words, more when you include technical terms, proper nouns, inflections ("running", "ran", "runs"), and words from other languages. A vocabulary covering all of them would be unmanageably large, and any novel word would be entirely unknown to the model.

The solution is Byte Pair Encoding (BPE), originally a data compression algorithm adapted for language modeling. The algorithm is disarmingly simple:

  1. Start with individual characters as your vocabulary.
  2. Find the most frequently occurring consecutive pair of tokens in the training corpus.
  3. Merge that pair into a single new token. Add it to the vocabulary.
  4. Repeat 50,000 times.

After 50,000 merges, you have a vocabulary of roughly 50,000 token types. Common words like "the," "is," "running" become single tokens. Less common words split into recognizable subword pieces:

"tokenization"  →  ["token", "ization"]
"unbelievable"  →  ["un", "believ", "able"]
"ChatGPT"       →  ["Chat", "G", "PT"]
"2026"          →  ["2026"]        ← numbers often stay whole
"swapnanilsaha" →  ["sw", "ap", "nan", "il", "saha"]

Each token is assigned an integer ID — its index in the vocabulary. This is what the model actually receives: a sequence of integers. "The quick brown fox" might become [464, 2068, 7586, 21831]. The text has been translated into a language computers can work with natively.

Rule of thumb

In English text, 1 token ≈ 4 characters, or about ¾ of a word. "The quick brown fox" = 4 tokens. A typical paragraph of 100 words ≈ 133 tokens. Most LLMs are priced per token — knowing this helps you estimate costs.

Special tokens serve structural roles. [CLS] marks the start of a sequence for classification tasks. [SEP] separates two sequences. <|endoftext|> signals the end of a document in GPT-style models. The model learns what these mean during training.

The total number of tokens a model can process at once is its context window. GPT-4o has a 128,000-token context window. Claude 3 has 200,000. These numbers matter enormously for RAG — they determine how much retrieved text you can hand to the model at inference time.

Interactive · Demo 01

Token Explorer

Type any text and see it split into approximate tokens. Each colored pill is one token. Watch how common words stay whole while long or rare words get split into subword pieces.

⚠ This is an approximation — actual tokenization uses tiktoken (GPT) or SentencePiece (LLaMA). The split boundaries are illustrative, not exact.
03

From Tokens to Vectors

Integers work for indexing but they're useless for representing meaning. The integer 547 has no meaningful relationship to 546 or 548. We need representations that are continuous — where distance between representations reflects actual semantic distance.

The first step is the embedding matrix: a learnable table of shape [vocab_size × hidden_dim] — say 50,000 × 768. Each row is a vector of 768 floating-point numbers representing one token type. When token ID 547 enters the model, we simply look up row 547 and retrieve its 768-dimensional vector.

Initially, these rows are random. There's no reason the vector for "dog" (ID 2847) should be near the vector for "puppy" (ID 31092). That changes during training — the model adjusts every row of the embedding matrix in response to its errors, and over millions of training steps, tokens that appear in similar contexts get pushed toward each other in the embedding space.

Analogy

Think of the embedding matrix as a city with 50,000 addresses. Initially, addresses are assigned randomly. After training, buildings that serve similar purposes end up in the same neighborhood — not because someone planned it, but because the traffic patterns (language statistics) shaped where things settled.

Word2Vec and the King–Queen Analogy

In 2013, the Word2Vec paper demonstrated something remarkable: if you train a neural network to predict neighboring words from a target word (or vice versa), the learned vectors develop meaningful geometric structure without any explicit supervision.

The famous example: take the vector for "king," subtract the vector for "man," and add the vector for "woman." The result is closest, in the embedding space, to "queen." No one programmed gender into the model. The geometry fell out of the statistics of language — because "king" and "queen" appear in similar contexts, as do "man" and "woman," and the model learned to encode that parallel relationship as a consistent direction in vector space.

king − man + woman ≈ queen
Paris − France + Germany ≈ Berlin
running − run + walk ≈ walking

This showed that embedding spaces capture not just similarity but structure — relationships between concepts, not just proximity.

The Context Blindness Problem

Word2Vec gives each word exactly one vector, always. "I made a deposit at the bank" and "I fished at the river bank" — "bank" gets the identical vector in both sentences. Context is completely ignored. The model has no mechanism to distinguish which sense is meant.

This is not a minor limitation. In practice, most words have multiple senses. "Light," "right," "fair," "set" — these are famously ambiguous. A single fixed vector per word-type is a fundamental ceiling on representational quality. Overcoming it requires a model that can look at surrounding context before deciding what a word's representation should be. That mechanism is attention.

04

Context Changes Everything — Attention

The transformer's core innovation is the attention mechanism — a way for every token in a sequence to look at every other token and update its own representation based on what it sees. It's what enables "bank" in a river sentence to mean something completely different from "bank" in a financial sentence.

The Library Catalog Analogy

Imagine you're in a library looking for books about cooking. You have a query: "cooking techniques." Each book has a catalog entry with a key: its title, subject, and summary. You scan all the keys, score them by how well they match your query, and retrieve the values — the actual books — proportional to those scores. High-scoring books contribute more to what you take away; low-scoring books contribute almost nothing.

Self-attention works exactly like this, but every token plays all three roles simultaneously — it has a query, a key, and a value — and it attends to every other token in the sequence.

The Mathematics

For each token, three vectors are computed from its embedding by multiplying with three learned weight matrices: Q (query), K (key), V (value). Each has dimension d_k (often 64 per head). Then:

Self-Attention Formula

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

For each token i: compute dot product of Q_i with every K_j. Scale by √d_k. Apply softmax → weights that sum to 1. Take weighted sum of all V_j. That's the new representation for token i.

The scaling by √d_k is not arbitrary. In high dimensions, dot products grow large in magnitude. Large dot products fed into softmax produce nearly one-hot outputs — almost all weight on a single token. That kills the gradient signal during training. Dividing by √d_k keeps the inputs to softmax in a range where the function produces useful, distributed probability weights.

What Happens in Practice

In "I sat by the river bank," when the model processes the token "bank," its query vector asks "what context am I in?" The key vector for "river" matches strongly — so "river" gets high attention weight. The key vectors for "sat" and "by" match moderately. The final representation for "bank" is a weighted blend of all the value vectors, dominated by the values from "river" and the nearby context. The resulting vector reflects the geographical sense, not the financial one.

Feed the same model "I deposited money at the bank" and "bank" now attends strongly to "deposited" and "money." The identical input token "bank" produces a completely different output vector because the surrounding context is different. Context blindness: solved.

Multi-Head Attention

Running one attention operation captures one type of relationship. Transformers run H independent attention heads in parallel, each with different Q, K, V weight matrices, then concatenate the results. Each head can learn to focus on a different aspect — one head might track syntactic dependencies (subject-verb agreement), another semantic similarity, another coreference. The combined result is richer than any single head could produce.

Positional Encoding

There's a subtle problem: pure attention is position-agnostic. "The dog bit the man" and "The man bit the dog" have the same words and would produce the same attention scores without some way to encode position. Positional encodings — either fixed sinusoidal functions or learned vectors — are added to each token's embedding before the first attention layer, injecting information about where each token sits in the sequence.

Most modern models (LLaMA, Mistral) use RoPE (Rotary Position Embedding) — a more elegant approach that encodes relative position directly into the attention computation rather than adding absolute position vectors to embeddings.

Why Transformers Scale on GPUs

Attention computes all pairwise token interactions simultaneously via matrix multiplication. With N tokens, you compute one N×N matrix of attention scores in a single batched operation. This parallelism is exactly what GPU hardware is designed for — thousands of multiply-accumulate units doing identical work in parallel. It's why transformers replaced sequential architectures like LSTMs at scale.

Part 02
The Architectures
05

BERT vs GPT — Two Philosophies

Both BERT and GPT are transformers built on the attention mechanism. Both learn from vast corpora of text. The difference is architectural and purposeful — they were designed to solve different problems, and that design choice reverberates through everything they're good at.

Their shared ancestor is the 2017 paper "Attention Is All You Need," which introduced the transformer. BERT (2018) came after GPT-1 (2018), not before — a common misconception. Both are applications of the same foundational architecture in different configurations.

PropertyBERT (Encoder-only)GPT / LLaMA (Decoder-only)
Attention directionBidirectional — sees all tokensCausal — sees only past tokens
Training objectivePredict masked tokens (MLM)Predict next token
OutputContextual embedding per tokenNext token probability distribution
Can generate text?NoYes
Best forClassification, NER, embeddingsGeneration, reasoning, chat
ExamplesBERT, RoBERTa, DeBERTaGPT-4, LLaMA 3, Mistral, Gemma

Why Bidirectionality Matters for Embeddings

When you embed the sentence "The cat sat on the mat" and process the token "cat," a bidirectional model sees "The" before it and "sat on the mat" after it. The full context shapes the representation. A causal model like GPT, processing "cat," can only see "The" — the rest of the sequence hasn't arrived yet. The embedding is impoverished by design.

This is why embedding models — which need to produce a single vector capturing the full meaning of a text — are generally encoder-style, not decoder-style. They're built to understand, not to generate.

Analogy

Writing a novel: you can reference earlier chapters but can't peek at future ones — that's causal attention, by design for generation. Reading for comprehension: you can re-read the whole passage before answering a question — that's bidirectional, ideal for understanding. Same words, different process, different capability.

06

Embedding Models — Trained to Measure Meaning

A regular language model, even a BERT-style encoder, doesn't naturally produce good sentence-level embeddings. Its token-level representations are rich, but pooling them (averaging all token vectors) produces mediocre results for similarity search. The reason: the model was never explicitly trained to make similar sentences produce similar pooled vectors.

Dedicated embedding models fix this with a different training objective: contrastive learning. The setup is elegant. During training, the model sees pairs of texts:

  • Positive pairs: a question and its correct answer, a document and its summary, two paraphrases of the same sentence.
  • Negative pairs: random unrelated texts sampled from the corpus.

For each pair, embed both texts, compute their cosine similarity. The loss function pushes positive pairs toward similarity 1 and negative pairs toward 0. Over millions of training examples, the model learns to build an embedding space where semantic proximity = geometric proximity.

What Emerges

The resulting space has remarkable geometric structure. Synonyms cluster together. Antonyms sit far apart (though in a specific direction — "hot" and "cold" are more related than "hot" and "democracy"). Analogies form parallel lines. Semantic categories form convex regions. The King–Woman+Man=Queen arithmetic works because the model's contrastive training has aligned the "gender" direction consistently across the royalty cluster and the person cluster.

Dimensionality & Memory

text-embedding-3-small produces 1,536-dimensional vectors. At float32 (4 bytes each): 6KB per embedding. One million embeddings = 6GB. At float16 (2 bytes): 3GB per million. Most vector databases store in float32 or binary-quantized variants. Dimensionality is chosen to balance expressiveness against storage cost — 768 is minimum viable for most tasks, 3,072 is the current ceiling for top models.

Below is an interactive projection of ~50 words into 2D space. The actual embedding space has 1,536 dimensions — this is a conceptual visualization showing how semantic clusters would appear after dimensionality reduction. Click two words to see their similarity score.

Interactive · Demo 02

Word Embedding Space

A 2D projection of ~50 words showing semantic clustering. Words close together have similar meaning. Click any two words to see their similarity. Toggle the analogy arrows to see how parallel vectors encode relationships.

Click any word to select it, then click another to compute similarity.
Positions are illustrative 2D projections — real embeddings are 1,536-dimensional. Similarity scores here are based on 2D Euclidean distance in this visualization.
Part 03
Search at Scale
07

Chunking Strategies

Embedding models have a context window — a maximum number of tokens they can process in one pass. text-embedding-3-small handles up to 8,192 tokens. That sounds generous until you try to embed a 200-page legal contract or a 50,000-word technical manual. Both exceed it by orders of magnitude.

Even if the limit didn't exist, you wouldn't want to embed an entire document as one vector. A single vector representing 50,000 words has to simultaneously encode every topic in that document. When you search for "refund policy," your query vector must compete against the diluted signal of every other topic — shipping, returns, warranty, privacy, and everything else. The vector is too blurry to be useful.

Chunking solves both problems by splitting documents into smaller, focused pieces before embedding. The art is in choosing how.

Fixed-Size Chunking

The naive approach: split every 512 tokens, full stop. Simple, fast, predictable. Also potentially terrible — a chunk might contain the final clause of one sentence and the opening of the next, capturing neither cleanly. It's the equivalent of cutting pages out of a book at fixed line counts, indifferent to paragraph or chapter boundaries.

Sliding Window with Overlap

The workhorse strategy in production. Choose a chunk size (say 400 tokens) and a stride shorter than the chunk (say 200 tokens). Chunk 1 covers tokens 0–399. Chunk 2 covers tokens 200–599. Chunk 3 covers 400–799. Each chunk shares 200 tokens with its neighbors, so any sentence that falls near a boundary appears in full in at least one chunk. The overlap trades storage (more chunks) for retrieval robustness.

Recursive Character Splitting

Respects document structure without requiring semantic understanding. Attempt to split on \n\n (paragraph breaks) first. If the resulting pieces are still too large, try \n. Then . . Then . Stop at the smallest unit that fits within your size limit. Used by LangChain's RecursiveCharacterTextSplitter and most production chunking libraries.

Embedding-Based Semantic Chunking

The most expensive but most accurate approach. Embed every sentence individually. Compute cosine similarity between consecutive sentences. Where similarity drops sharply — the sentence pairs that are least related — start a new chunk. You need an embedding model running before you build the index, just for preprocessing. In practice, this doubles the time and cost of indexing but produces substantially better chunk boundaries for heterogeneous documents.

Common Mistake

Chunks that are too large dilute the embedding signal. Chunks that are too small lose context (a sentence fragment is harder to embed meaningfully than a full paragraph). A useful starting point: 200–400 tokens per chunk with 10–20% overlap. Evaluate retrieval quality on your actual data rather than using defaults blindly.

Metadata Prepending

A simple technique with outsized impact: prepend the document title, section header, or other metadata to each chunk before embedding. Instead of embedding "It must be submitted within 30 days," embed "Refund Policy — Processing Timeline: It must be submitted within 30 days." The chunk now contains its own context, so retrieval is more accurate even when the user's query doesn't contain the exact section name.

Interactive · Demo 03

Chunk Explorer

Paste any text and adjust chunk size and overlap to see how different strategies split it. Watch how overlap prevents context loss at boundaries.

08

The Mathematics of Similarity

Once you have embeddings, you need a function that measures how similar two vectors are. The choice of similarity metric has real consequences for retrieval quality, and different metrics make sense for different situations.

Manhattan Distance (L1)

Sum the absolute differences along each dimension: ||a − b||₁ = Σ|aᵢ − bᵢ|. Geometrically, this is the distance you'd travel on a city grid — you can only move along axes, never diagonally. It's robust to outliers (a single large-difference dimension doesn't dominate the way it does in L2) but doesn't capture the directionality of meaning.

Euclidean Distance (L2)

The straight-line distance: ||a − b||₂ = √(Σ(aᵢ − bᵢ)²). Intuitively natural — it's what "distance" meant in your high school geometry class. Works well in low dimensions but suffers from a fundamental problem at scale.

In very high-dimensional spaces (like 1,536 dimensions), a phenomenon called the curse of dimensionality kicks in: the ratio of the farthest point to the nearest point approaches 1 as dimensions grow. Every point becomes roughly equidistant from every other. L2 distance loses its discriminative power — the differences in distance between "similar" and "dissimilar" vectors shrink to noise.

Cosine Similarity

Cosine Similarity Formula

cos(θ) = (a · b) / (||a|| · ||b||)

Measures the angle between two vectors, completely ignoring their magnitudes. Range: −1 (opposite directions) to 0 (orthogonal) to 1 (same direction).

Cosine similarity is the dominant metric in NLP for a key reason: it's scale-invariant. A short book and a long encyclopedia on identical topics will have embedding vectors pointing in the same direction — similar topic, different magnitude. Cosine similarity sees them as identical. L2 would see them as far apart because the longer text produces a larger-magnitude vector. For semantic search, you care about direction, not magnitude.

Dot Product

a · b = Σaᵢbᵢ. When vectors are unit-normalized to length 1 (||a|| = ||b|| = 1), the dot product equals cosine similarity. Most vector databases normalize embeddings before storage for exactly this reason — then maximum inner product search (MIPS) is mathematically equivalent to cosine search, and inner products are extremely fast to compute on GPU hardware.

Which to use?

Cosine similarity for semantic text search (the standard). L2 when magnitude matters (clustering points that represent counts or quantities). Dot product when your embeddings are pre-normalized (faster than explicitly dividing by norms). L1 rarely in NLP, but sometimes in anomaly detection where outlier sensitivity is undesirable.

Interactive · Demo 04

Distance Metrics Playground

Enter two 3-dimensional vectors and see all four distance metrics computed in real time. The visualization shows the first two dimensions.

L1 Manhattan
Σ|aᵢ − bᵢ|
L2 Euclidean
√Σ(aᵢ − bᵢ)²
Cosine Similarity
(a · b) / (||a|| · ||b||)
Dot Product
Σ aᵢ · bᵢ
2D VIEW (dims 0 & 1)
09

Approximate Nearest Neighbor — Search at Billion Scale

You have a million document embeddings stored. A user submits a query. You need the 5 most similar embeddings. The brute-force approach: compute cosine similarity between the query and every stored vector, sort, return top 5.

Time complexity: O(n × d) — n vectors, each with d dimensions. For n=1,000,000 and d=1,536: 1.5 billion multiply-accumulate operations per query. On a modern server, perhaps 50–100ms. With 100 concurrent users, already straining. At n=1 billion vectors: 1,500 billion operations. Multiple minutes per query. Completely unusable.

Approximate Nearest Neighbor (ANN) algorithms trade a small accuracy loss for enormous speedups. The key insight: you almost never need the exact nearest neighbor. For search, retrieval, and recommendation, 95% recall at 50× speed is the overwhelmingly correct trade. Users don't notice the 5% of cases where the 6th-best result was returned instead of the 5th-best.

HNSW — The Current Standard

HNSW (Hierarchical Navigable Small World) is currently the state-of-the-art algorithm for dense vector search. To understand it, you need a detour into network theory.

In 1967, social psychologist Stanley Milgram demonstrated that any two people in the United States could be connected through about 6 acquaintances — the "six degrees of separation" finding. Networks with this property have a specific structure: they combine local clustering (your friends know each other) with a few long-range hub connections (people who bridge distant social groups). These long-range connections are what make the network navigable — they let you jump across large distances quickly.

HNSW builds a multi-layer graph that deliberately recreates this structure:

  • Layer 0 (base): every node exists, connected to nearby neighbors within a short radius. Dense and detailed.
  • Layer 1: a random subset of nodes (roughly 1/M of layer 0), connected over longer distances. The "city map."
  • Layer 2 (and above): an even smaller subset, connected over very long distances. The "state map."

To find the nearest neighbor of a query vector:

  1. Start at the top layer, at a fixed entry node.
  2. Greedily navigate: check all neighbors of the current node, move to whichever is closest to the query.
  3. When no neighbor is closer than the current node, drop to the layer below. Start from the same node.
  4. Repeat until you reach layer 0. At layer 0, expand the search with a beam of ef candidates to avoid greedy local minima.
  5. Return the best result found.

The result: instead of examining all 1 billion vectors, you typically examine a few hundred — following the path from the coarse top layer down to the precise bottom layer. Like finding a house: first look at a state map to find the right city, then a city map to find the right neighborhood, then walk the streets to find the exact address.

HNSW Parameters

M — edges per node (default 16). Higher M → better recall, more memory. Each node stores M neighbors per layer. Memory: O(M × n × d_index).
ef_construction — beam width during index build (default 200). Higher → better index quality, slower build.
ef_search — beam width during query (default 50). Higher → better recall, slower query. This is the primary recall vs. speed lever.

Interactive · Demo 05

HNSW Graph Navigator

Watch HNSW find the nearest neighbor to the red star (query point). The algorithm starts at the coarse top layer, navigates greedily, then descends layer by layer to the precise result. Notice how few nodes it visits compared to brute force.

Click "Run Demo" to watch the traversal. Use the dropdown to explore the layer structure first.
35 nodes shown. Query point = red star. Brute force would check all 35. HNSW checks ~10–14 (28–40%).

IVF — Inverted File Index

A different approach: cluster all vectors into k groups using k-means. Store each vector in its nearest cluster's "bucket." At query time, embed the query, find the nearest p cluster centroids (the nprobe parameter), and search only those clusters. Trade-off: you must choose k wisely and accept that queries near cluster boundaries might miss their nearest neighbor if it falls in an unsearched cluster. IVF trades some recall for significantly lower memory usage than HNSW — it doesn't need to store explicit edge lists.

ANNOY — Approximate Nearest Neighbors Oh Yeah

Spotify's algorithm builds a forest of binary trees. Each tree is constructed by picking a random hyperplane that splits the data in two, then recursively splitting each half. To answer a query, traverse all trees simultaneously, take the union of candidates, compute exact distances, return top k. Simple, memory-efficient (just the tree structure), easy to add new vectors without rebuilding. Good for moderate scales and cases where you need to add vectors frequently without a full index rebuild.

AlgorithmBest forRecallMemoryBuild speed
HNSWHigh recall, production searchExcellentHighSlow
IVFMemory-constrained, large scaleGoodLowMedium
ANNOYRead-heavy, simple deploymentGoodMediumFast
10

Vector Databases

A vector database is an ANN index plus the production infrastructure that makes it deployable: persistence, metadata storage, filtering, horizontal scaling, and often hybrid search. The index alone answers "which vectors are nearest?" — the database answers "which documents about topic X, written after January 2025, matching user ID 99, are most semantically similar to this query?"

What Databases Add

Metadata filtering is the critical addition. Real queries are rarely pure semantic search. "Find me product docs about GPU memory, but only from the v4.2 release" requires combining vector similarity with a structured filter. A pure ANN index can't do this. Vector databases precompute metadata indexes alongside the vector index, enabling efficient hybrid queries.

Upserts and deletes are harder than they sound. HNSW is notoriously difficult to update — the graph structure is tightly coupled to the data. Most implementations use soft deletes (mark deleted, filter at query time) with periodic index rebuilds, or segment-based approaches where recent updates go into a flat index that's merged into HNSW periodically.

Hybrid search combines dense vector search with BM25 sparse keyword search. Some databases implement this natively (Weaviate, Qdrant). The relevance score becomes a weighted sum: score = α × dense_score + (1−α) × sparse_score. Hybrid search handles exact matches (product codes, names, technical terms) that dense search can miss when the query and document use slightly different phrasing.

DatabaseBest forDeploymentHybrid search
pgvector<10M vectors, existing PostgresSelf-hosted or managedVia Postgres full-text
QdrantHigh-perf production, great filteringSelf-hosted or cloudBuilt-in, excellent
WeaviateHybrid search, open sourceSelf-hosted or cloudBuilt-in BM25+vector
PineconeEasiest onboarding, fully managedServerless SaaSYes
ChromaPrototyping, local devEmbedded / localBasic

For most teams with fewer than 10 million vectors and an existing Postgres stack, pgvector is the underrated choice — zero new infrastructure, HNSW index support, and full SQL expressiveness for metadata filtering. For serious production scale or sophisticated hybrid search requirements, Qdrant's Rust-based core tends to win on performance benchmarks.

Part 04
The Complete System
11

The RAG Pipeline

Large language models know a great deal, but they don't know your data. They don't know your internal documentation, your customer history, your proprietary knowledge base, or anything that changed after their training cutoff. And retraining or fine-tuning a model on new knowledge is expensive — measured in tens of thousands of dollars and weeks of computation at scale.

Retrieval-Augmented Generation (RAG) solves this by connecting the LLM to an external knowledge source at inference time, without touching the model's weights. The model stays frozen; the knowledge is retrieved and injected per query.

The Four Phases

1. Index. Take your documents. Chunk them (Section 07). Embed each chunk using an embedding model (Section 06). Store the (vector, chunk_text, metadata) tuple in a vector database (Section 10). This is a one-time offline process, updated when your documents change.

2. Retrieve. When a user submits a query, embed it with the same embedding model used during indexing. Run ANN search against the vector database (Section 09). Return the top-k chunks whose vectors are closest to the query vector.

3. Augment. Take the retrieved chunk texts and inject them into the LLM's system prompt or context. The prompt now looks something like: "Answer the following question using only the context provided. Context: [retrieved chunk 1] [retrieved chunk 2] [retrieved chunk 3]. Question: [user query]."

4. Generate. The LLM reads the question and the retrieved context, and generates an answer grounded in the provided documents. It's not inventing — it's reading and synthesizing.

Retrieval Quality Is the Bottleneck

If you retrieve irrelevant chunks, the LLM will either hallucinate (inventing an answer not in the context) or correctly say "I don't have enough information." The LLM's reasoning ability is not the bottleneck in most RAG systems — retrieval precision is. Invest heavily in chunking strategy and retrieval evaluation before optimizing the LLM.

Hybrid Search

Dense vector search excels at semantic similarity — finding the document that means the same thing even if it uses different words. But it can miss exact matches. A user searching for "GDPR Article 17" or "SKU-4892-XL" is using precise identifiers where keyword matching is actually better. BM25 (the standard sparse retrieval algorithm used in most search engines) handles exact matches perfectly but can't find semantically similar documents using different vocabulary.

Production RAG systems almost universally use hybrid search: run both dense and sparse retrieval in parallel, then combine scores: final_score = α × cosine_score + (1−α) × bm25_score. Typical α = 0.7 (weighting dense more). Some systems learn α from relevance feedback.

Re-ranking

Dense retrieval uses bi-encoders — embed the query once, embed each document once, compare independently. This is fast (O(1) per document at query time, since document embeddings are precomputed) but less precise. A cross-encoder takes the query and a candidate document together as a single input and scores their relevance jointly. This is far more accurate — the model can compare them directly — but can't be precomputed and is expensive to run per document.

The solution: use bi-encoder dense retrieval to get the top 20 candidates quickly. Then run a cross-encoder re-ranker on those 20 to re-sort them accurately. Return the top 5 re-ranked results to the LLM. Two passes, each playing to its strength.

Where Things Go Wrong

Bad chunking produces chunks that don't contain the answer to the query even when the document does. The wrong embedding model (one not trained on your domain vocabulary) produces vectors where semantically similar texts in your domain aren't geometrically close. A k that's too small misses relevant chunks; too large floods the LLM's context with noise. No re-ranking lets lower-quality chunks displace better ones. Context window exhaustion (too many retrieved chunks, query, and instructions exceeding the LLM's limit) silently truncates context. Each failure mode requires different diagnosis and different fixes.

Part 05
LLM Internals
12

How LLMs Predict the Next Token

We've talked about what LLMs do at a high level. Now let's trace exactly what happens computationally, step by step, from the moment you hit enter to the moment a token appears.

The Forward Pass

Step 1: Tokenize. Your input text is split into tokens and mapped to integer IDs. "Hello, how are you?" might become [15496, 11, 703, 389, 345, 30].

Step 2: Embed and position. Each integer ID is looked up in the embedding matrix → a vector of size d_model (e.g., 4,096 for LLaMA 3 8B). Positional encodings are added (or applied via RoPE within the attention computation).

Step 3: Pass through N decoder layers. Each layer applies two sub-operations:

  • Causal self-attention: same mechanism as before, but with one critical difference — token i can only attend to tokens 0 through i. Future tokens are masked to −∞ before softmax, giving them zero weight. This prevents the model from "cheating" by looking at the answer it's supposed to predict.
  • Feed-forward network (FFN): two or three linear layers with a nonlinear activation (GELU or SwiGLU). Applied independently at each token position. This is where most of the "factual knowledge" is stored — the FFN has more parameters than the attention block in most architectures.

RMSNorm is applied before each sub-block (pre-norm) rather than after. This stabilizes training at large scale.

Step 4: Project to vocabulary. Take the final hidden state at the last token position — a vector of size d_model. Multiply it by the language model head matrix (shape [d_model × vocab_size]) to get a vector of vocab_size raw scores called logits.

Step 5: Softmax → sample. Apply softmax to the logits to get a probability distribution over all ~50,000 tokens. Sample one token from this distribution. Append it to the sequence and repeat.

Controlling Generation: Temperature and Sampling

Temperature is the dial between boring and chaotic. Before softmax, divide all logits by temperature T. T < 1 makes the distribution peakier — the most probable token gets even more weight. T = 0 is greedy decoding (always pick the single highest-probability token). T > 1 flattens the distribution, making surprising tokens more likely. T = 1 leaves logits unchanged (standard sampling).

Top-p (nucleus) sampling is more principled than temperature alone. Sort tokens by probability. Take the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9). Sample from only those tokens. This cuts off the long tail of extremely unlikely tokens that temperature doesn't eliminate, preventing nonsensical outputs while preserving natural diversity.

The KV Cache

During generation, you generate one token at a time. Without optimization, generating token 1000 would require running a full attention forward pass over all 1000 previous tokens — O(n²) computation per token, and O(n) passes. Untenable for long sequences.

The KV cache stores the Key and Value matrices for all previous tokens after they've been computed. When generating the next token, you only need to compute Q, K, V for the new token, then attend to the cached K and V of all previous tokens. The cached tokens don't need to be recomputed. This reduces per-token generation cost from O(n²) to O(n) — a massive speedup for long contexts. It's why long-context inference is still expensive (the KV cache itself grows linearly and can exceed GPU memory) but not catastrophically so.

Why Generation Is Sequential (and Slow)

Each token requires waiting for the previous token to be generated — you can't generate token 5 without knowing tokens 1–4, because the attention is causal. This is fundamentally sequential, unlike training where all tokens are processed in parallel. The KV cache helps but can't fix the sequential bottleneck. Speculative decoding (using a smaller draft model to propose multiple tokens, verified in one pass by the main model) is the current best approach for speeding up generation.

Interactive · Demo 06

LLM Cost & Scale Calculator

Choose a model, set input and output token counts, and see real-time cost, estimated inference FLOPs, context window, and memory requirement.

API Cost
Context Window
max tokens per call
Inference FLOPs
≈ 2 × params × output tokens
Model Memory (bf16)
weights only, excl. KV cache
API pricing from public rate cards (May 2026). Self-hosted models: $0 API cost but you pay for GPU time. FLOPs estimate uses 2×N×tokens approximation for forward pass only.
13

Parameters & Memory

"LLaMA 3 8B has 8 billion parameters." You've seen this number everywhere. What does it concretely mean, and how does it translate to the hardware you need to run the model?

What a Parameter Is

A parameter is one floating-point number in one weight matrix. Every matrix multiplication in the model involves two matrices: the input activations and the weight matrix. The weights are the learned parameters — they're what changes during training and what gets fixed at inference time.

Where Parameters Live in a Transformer Layer

Each transformer decoder layer (LLaMA-style) contains these weight matrices:

Q projection:    [d_model × d_model]  =  4096 × 4096  =  16.8M params
K projection:    [d_model × d_model]  =  4096 × 4096  =  16.8M params
V projection:    [d_model × d_model]  =  4096 × 4096  =  16.8M params
O projection:    [d_model × d_model]  =  4096 × 4096  =  16.8M params
FFN gate:        [d_model × 4·d_model] = 4096 × 16384 =  67.1M params
FFN up:          [d_model × 4·d_model] = 4096 × 16384 =  67.1M params
FFN down:        [4·d_model × d_model] = 16384 × 4096 =  67.1M params
RMSNorm (×2):    [d_model] × 2        =           8K params
─────────────────────────────────────────────────────────────
One layer total:                               ≈ 268M params

LLaMA 3 8B has 32 layers: 32 × 268M ≈ 8.5B. Plus the token embedding matrix (128,000 vocab × 4096 = 524M) and the output head (shared with embeddings) → total ≈ 8B. The math checks out.

Memory: Parameters × Bytes

Memory requirement for the model weights alone:

PrecisionBytes / param8B model70B model
float32432 GB280 GB
bfloat16 (standard)216 GB140 GB
int8 quantized18 GB70 GB
4-bit (GGUF/GPTQ)0.54 GB35 GB

4-bit quantization of LLaMA 3 8B fits comfortably on a MacBook Pro M3 (18GB unified memory). This is why local LLM inference has become viable for individuals — quantization trades a small accuracy reduction for 4–8× memory savings. For most non-critical applications, the quality loss is imperceptible.

Note: this is weights only. Add KV cache memory (proportional to batch_size × sequence_length × d_model × num_layers × 2, for K and V), and activation memory during inference. The KV cache for a 100k-token context window on a 70B model in bfloat16 can exceed 140GB on its own — why long-context inference remains expensive even with quantized weights.

Interactive · Demo 07

Model Scale & Training Calculator

Use the 6ND formula to compute training FLOPs, time, and cost. Drag the sliders and see why large model training costs tens of millions of dollars.

Total FLOPs (6ND)
6 × params × tokens
Training Time
Estimated Cost
GPU hours × hourly rate
vs GPT-3
GPT-3 ≈ 3.14 × 10²³ FLOPs
MFU (Model FLOP Utilization): fraction of theoretical peak FLOPS achieved in practice. Typical values: 35–50% for large clusters due to communication overhead, memory constraints, and load imbalance.
14

Training Economics

The calculator above makes it feel like a math problem. The reality is considerably messier. Let's look at what actually makes large model training so staggeringly expensive.

The 6ND Formula in Depth

Training FLOPs ≈ 6 × N × D. The "6" decomposes as follows: the forward pass through the model requires approximately 2ND floating-point operations (each parameter participates in roughly two multiply-adds per token). The backward pass (computing gradients) costs approximately 2× the forward pass — 4ND total. The remaining 2ND-equivalent comes from the optimizer step (Adam stores two additional optimizer states per parameter — first and second moment — and the update operation adds cost). Total: 6ND.

Real numbers:

LLaMA 3 8B   (N=8B,   D=15T):  6 × 8×10⁹ × 15×10¹² = 7.2 × 10²³ FLOPs
LLaMA 3 70B  (N=70B,  D=15T):  6 × 7×10¹⁰ × 15×10¹² = 6.3 × 10²⁴ FLOPs
LLaMA 3 405B (N=405B, D=15T):  6 × 4×10¹¹ × 15×10¹² = 3.6 × 10²⁵ FLOPs
GPT-3        (N=175B, D=300B): 6 × 1.75×10¹¹ × 3×10¹¹ = 3.1 × 10²³ FLOPs

Memory Bandwidth: The Real Bottleneck

Most discussions of training cost focus on FLOPS. The subtler and often more binding constraint is memory bandwidth — how fast the GPU can move data between the high-bandwidth memory (HBM) and the compute units (CUDA cores / tensor cores).

During training, for each parameter, you read the weight, compute the gradient, and write the updated weight. At 6 bytes per parameter (bfloat16 weight + gradient), 8 bytes per optimizer state (Adam: 4 bytes each for first and second moment in fp32): 14 bytes per parameter moved per step. An H100 has 3.35 TB/s of HBM bandwidth. An 8B model has 8B × 14 = 112GB of data to move per step. At 3.35 TB/s: ~33ms per step — and that's just for the memory movement, before any compute.

In practice, this is why MFU (Model FLOP Utilization) — the fraction of theoretical FLOP capacity actually used — sits at 35–50% for large clusters, not 100%. The model spends a significant fraction of time waiting for data to arrive from memory.

The Real Cost Drivers

Raw GPU time is only part of the expense. A 30-day training run at $2.50/H100-hr across 6,000 GPUs costs: 6,000 × 720 hours × $2.50 = $10.8M. But that's the easy-to-quantify part. Add:

  • Data preparation: curating, cleaning, deduplicating, and filtering 15 trillion tokens of training data is months of engineering work.
  • Failed runs: hardware failures, hyperparameter mistakes, loss spikes — each incident can waste days of compute and millions of dollars.
  • Infrastructure: fast interconnects (InfiniBand at 400 Gb/s between nodes), storage systems fast enough to stream training data without bottlenecking, power and cooling.
  • Post-training: instruction tuning (RLHF, DPO) and safety training are separate, expensive steps after the base model is trained.

GPT-4, which presumably involved substantially more compute than LLaMA 3, was estimated at $100M+ in total training cost. These numbers explain why frontier model development is confined to a handful of organizations globally.

15

The Hardware Reality

Why do GPUs dominate AI compute so thoroughly? The answer goes back to the fundamental structure of transformer operations.

GPUs and SIMD Parallelism

A GPU is built around thousands of small, identical compute units executing the same instruction simultaneously — SIMD (Single Instruction, Multiple Data) parallelism. An H100 has 18,432 CUDA cores and 528 Tensor Cores (specialized units for matrix multiplication in mixed precision). Matrix multiplication is perfectly SIMD-parallel: every element of the output matrix is the dot product of one row of the left matrix and one column of the right matrix, and all of those dot products are independent and can be computed simultaneously.

Transformers are, at their core, a sequence of matrix multiplications. They were designed — partly intentionally — to map perfectly onto GPU hardware. This is not a coincidence; the hardware and the algorithms co-evolved.

The Memory Bandwidth Bottleneck (Again)

For inference with small batch sizes (like serving a single user query), you're not compute-bound — you're memory bandwidth-bound. The GPU's tensor cores sit idle while they wait for weight matrices to arrive from HBM. An H100 performs 3.35 TB/s of memory bandwidth, but the 70B model requires moving 140GB of weights per forward pass. Even at maximum bandwidth, that takes 42ms just for data movement — and no useful computation happens during that time.

This is described by the roofline model: a graph of operations per byte of memory access against operational intensity (FLOPs per byte). If your model sits below the memory bandwidth "roof," you're bandwidth-bound, and throwing more compute at it doesn't help. LLM inference at small batch sizes almost always sits below the roof.

Alternative Hardware

Google TPUs are custom ASICs designed from the ground up for matrix multiplication. They use systolic arrays — a 2D grid of multiply-accumulate units where data waves through in a structured pattern, achieving near-100% utilization on large matrix operations. Google uses TPUs for all of its own model training (PaLM, Gemini) and serving. The MFU achievable on TPUs tends to be higher than on GPUs for identical workloads because the architecture is more specialized.

Groq's LPU (Language Processing Unit) attacks the bandwidth bottleneck directly. Instead of HBM with its 3.35 TB/s ceiling, Groq uses a large on-chip SRAM bank (~230MB, compared to L1/L2 caches of a few MB on a GPU) with ~1 PB/s effective bandwidth. This eliminates the HBM bottleneck entirely for models that fit on-chip. The trade-off is capacity: you can only fit smaller models. But for those models, Groq delivers extraordinarily fast and deterministic inference latency — predictable down to the millisecond, unlike GPU inference which varies with memory contention.

FPGAs (Field-Programmable Gate Arrays) can be configured to implement specific computations in custom hardware logic. Microsoft's Project Catapult deploys FPGAs in Azure to accelerate Bing's search pipeline, including vector similarity operations. FPGAs are slower to program than GPUs (FPGA "programming" means designing a circuit) but dramatically more energy-efficient for fixed, specific workloads. A custom FPGA circuit for cosine similarity consumes far less power than a general-purpose GPU running the same operation.

Mythic AI takes a different approach to the bandwidth problem: it performs matrix multiplication inside flash memory cells using analog current summation. Weights are stored in analog form in the memory array, and the computation happens in-place — no data movement between memory and compute at all. This is a fundamentally different physical approach, not just a different chip architecture.

Where This Is Heading

For most teams, GPU + optimized software (vLLM for inference, FAISS for vector search) remains the dominant stack. Dedicated ASICs for similarity search are technically viable but not yet mainstream for general use. The next 3–5 years will likely see memory-in-compute architectures (Mythic-style) and disaggregated memory systems (GPUs sharing a memory pool) become more practical, further decoupling model size from per-GPU memory requirements.

The Full Picture

We started with a simple observation: computers need numbers, and text is not numbers. The journey from that problem to a production AI system involves more layers than most practitioners realize.

BPE tokenization converts text to integer IDs, making it machine-readable. The embedding matrix converts integers to dense float vectors, giving them geometric identity. Transformer attention layers contextualize those vectors — letting every token influence every other, so "bank" means something different depending on whether it's next to "river" or "deposit." Contrastive training builds embedding models that make those vectors metrically meaningful — semantically similar texts end up geometrically close.

Chunking divides large documents into embeddable units. Cosine similarity provides a scale-invariant distance metric over those vectors. HNSW makes billion-vector search tractable through hierarchical graph navigation. Vector databases add the infrastructure layer — persistence, filtering, hybrid search — that makes the index production-deployable. RAG wires the retrieval system to an LLM, giving the model access to knowledge beyond its training cutoff without retraining its weights.

And underneath all of it: billions of learned parameters, spread across attention matrices and feed-forward networks, each updated millions of times during training on exascale compute clusters, consuming millions of dollars of electricity, to learn a high-dimensional geometry of meaning that lets a machine predict — with remarkable accuracy — what word comes next.

Every time you use a chatbot, a search engine, or an AI writing assistant, this entire stack runs. Now you know exactly what it's doing.

↑ Back to top