M3E3: Flash Attention and KV Cache: The Engineering That Makes Inference Possible

Episode 3: Flash Attention and KV Cache: The Engineering That Makes Inference Possible


The Physics Problem Underneath Every Context Limit

There is a number that governs nearly every practical constraint in large language model deployment, and it is not the parameter count, the training FLOP budget, or the benchmark score. It is memory bandwidth — specifically, the speed at which a GPU can move data between its main memory and its compute units. Everything else follows from that.

To understand why, walk through what happens on the silicon when a model generates a single token. The GPU's job is matrix multiplication: projecting query, key, and value vectors, computing attention scores, weighting value contributions. These operations are arithmetically intensive, and modern GPUs are extraordinarily fast at them. The NVIDIA A100 SXM delivers 312 TFLOPS (trillion floating-point operations per second) at FP16 (16-bit floating-point precision) — a staggering throughput. But raw FLOP capacity is almost irrelevant to inference performance, because the GPU cannot perform those operations on data it hasn't loaded yet, and loading data from its main memory is orders of magnitude slower than performing arithmetic on it.

The memory system of a data-center GPU has two tiers that matter here. The large tier — HBM, or High Bandwidth Memory — holds the model weights, the KV cache (a stored record of previously computed attention data, explained in detail below), and the activations. The A100 provides 80 GB of HBM2e at 2.0 TB/s. The H100 SXM5 pushes this to over 3 TB/s with HBM3, effectively a 2× bandwidth increase over the A100. The small, fast tier — SRAM (Static Random-Access Memory), or shared memory within each Streaming Multiprocessor — operates at bandwidths roughly an order of magnitude higher, but is measured in kilobytes per SM rather than gigabytes. On the A100, SRAM operates at approximately 19 TB/s but provides only 164 KB per SM, versus HBM's 2.0 TB/s and 80 GB total.

That ratio — roughly ten times slower, ten thousand times larger — is the hardware reality that every inference optimization must negotiate. The gap between arithmetic capability and memory bandwidth has a name in computer architecture: memory-bandwidth boundedness. For LLM inference, the decode phase (the step-by-step generation of each output token) is textbook memory-bandwidth bound: the bandwidth difference between A100 and H100 is the primary driver of H100's inference speedup on memory-bound attention operations, where the GPU is waiting on VRAM reads rather than doing compute, with the decode phase requiring reading the entire model weight tensor for every generated token.

Now consider what standard attention does with this hardware. Given a sequence of N tokens, the attention mechanism must compute the N×N matrix of scores — each query attending to every key. For a sequence of 4,096 tokens with 32 attention heads, this produces attention matrices that must be written to HBM, read back for the softmax normalization step, written again, read back for the weighted sum over values, and written again for the output. The time and memory complexity of self-attention are quadratic in sequence length. The compute itself scales quadratically with sequence length; the data movement — the reads and writes to slow HBM — scales just as badly and compounds the damage.

Standard attention requires Θ(Nd + N²) HBM accesses, where N is the sequence length and d is the head dimension. At N=4096, that is roughly 17 million HBM accesses for the N² term alone, at each attention layer, for each forward pass. Multiply by 32 layers, and the HBM traffic dominates everything. The GPU's thousands of CUDA cores sit largely idle, starved for data, while the memory bus churns. Model FLOPs Utilization (MFU, the fraction of theoretical peak compute actually used) of 30–50% is typical — GPUs stall on memory access, not compute.

This is the specific bottleneck that Flash Attention was designed to attack. Not the compute. The data movement.


Flash Attention: Tiling the Computation to Stay in SRAM

The crucial insight behind Flash Attention, published by Tri Dao, Daniel Fu, Stefano Ermon, Atri Rudra, and Christopher Ré from Stanford in 2022, was that the memory bottleneck in standard attention was not inevitable — it was a consequence of how the computation was organized, not a fundamental requirement of the mathematics. The missing principle was making attention algorithms IO-aware: accounting for reads and writes between levels of GPU memory. Flash Attention uses tiling to reduce the number of memory reads and writes between GPU HBM and GPU on-chip SRAM.

The standard implementation of attention materializes the full N×N score matrix. You compute QKᵀ (the dot product of query and key matrices), write it to HBM, read it back to apply softmax, write the result back to HBM, read it again to multiply by V. This is the programmer's natural approach: each operation is a clean kernel call, each intermediate result lives in the only memory large enough to hold it, which is HBM. It is the obvious implementation — and it is catastrophically slow because it turns what could be a single pass through the data into five separate round trips to slow memory.

Flash Attention prevents materialization of the large N×N attention matrix on slow GPU HBM. In the outer loop, it loops through blocks of the K and V matrices and loads them to fast on-chip SRAM. In each block, it loops over blocks of the Q matrix, loading them to SRAM, and writes the output of the attention computation back to HBM only at the end. The key technical challenge this creates is the softmax: to normalize attention scores correctly, you need to know the sum of all exponentials across the entire row, but in a tiled computation, you have only seen a fraction of the row at any given time. The solution, which had existed in the numerical analysis literature, is the online softmax — you compute a running maximum and running sum-of-exponentials incrementally, adjusting previous partial results as new blocks arrive. By scaling the output of each block by the right normalization factor before adding them up, the algorithm gets the correct result at the end.

The result is mathematically exact — not an approximation. This is the non-obvious part. Before Flash Attention, the research community had expended considerable effort on approximate attention methods — sparse attention, linear attention, locality-sensitive hashing variants — precisely because reducing the N² cost seemed to require approximation. Approximate attention methods had attempted to address the quadratic problem by trading off model quality to reduce compute complexity, but often did not achieve wall-clock speedup. Flash Attention demonstrated that the exact computation could be done with a fundamentally different memory access pattern that made the wall-clock time plummet even though the FLOP count stayed the same or increased slightly (due to recomputation in the backward pass). The speedup came not from doing less work but from doing the same work with far fewer HBM accesses.

For typical values of d and M, Flash Attention requires up to 9× fewer HBM accesses compared to standard attention. This yields 2–4× wall-clock time speedup over optimized baselines, with up to 10–20× memory saving, with no approximation. Flash Attention-2, published by Dao alone at ICLR 2024 (the International Conference on Learning Representations), refined the parallelism and work partitioning strategies to squeeze additional efficiency from the same tiling idea. Flash Attention-3 was optimized specifically for Hopper GPUs like the H100, exploiting the H100's asynchronous execution capabilities to overlap memory movement and computation.

The benchmark numbers tell the story concretely. Flash Attention trains transformers faster than existing baselines: a 15% end-to-end wall-clock speedup on BERT-large (a standard language model benchmark) at sequence length 512, 3× speedup on GPT-2 at sequence length 1K, and 2.4× speedup on long-range arena benchmarks at sequence lengths 1K–4K. Those speedups are larger at longer sequences — precisely the regime that matters most for frontier deployment, where context lengths measured in thousands of tokens have given way to context lengths measured in hundreds of thousands.


What Flash Attention Made Possible: The RAG vs. Long-Context Decision

The practical significance of Flash Attention is not primarily about training speed. It is about what it made feasible at inference time, at scale, in production: long context windows.

Before Flash Attention, extending context beyond a few thousand tokens was effectively prohibited by memory. At N=32K tokens, a naive implementation of the N×N attention matrix in FP16 requires roughly 2 GB per layer, per GPU, per request — and that memory had to be written and re-read from HBM multiple times per forward pass. At 32 layers, a single request with a 32K context would consume memory that could otherwise serve dozens of shorter requests, while simultaneously hammering the memory bus with data movement that crowded out everything else. The economics were untenable.

Flash Attention and block-sparse Flash Attention enabled longer context in transformers, yielding higher quality models and entirely new capabilities, including the first transformers to achieve better-than-chance performance on the Path-X challenge at sequence length 16K and Path-256 at sequence length 64K. Those were training experiments, but the same memory efficiency carries into inference. Once the O(N²) HBM materialization was replaced with O(N) SRAM residency — meaning memory requirements grew linearly rather than quadratically with sequence length — the question of context length became primarily one of KV cache memory rather than attention computation memory.

The result is the context window landscape that CAIOs (Chief AI Officers) now navigate. Gemini 2.5's 1M token context window, Claude 4's 200K+ context, the large long-context variants of GPT-5 — these are architecturally enabled by Flash Attention and its successors. The computation would have been impractical at these lengths without it.

This capability shift created a strategic decision that belongs squarely in the CAIO's domain: for document-heavy applications — legal review, financial analysis, multi-document research synthesis, large codebase understanding — should the architecture route through a retrieval system using RAG (Retrieval-Augmented Generation, a technique that selects relevant document chunks before inference rather than loading the full document) or should it load full documents into a long context window and let the model attend across everything?

The tradeoffs are real and not symmetric. Long-context inference is computationally expensive even with Flash Attention — the cost grows roughly linearly in memory requirements and has meaningful latency implications that we will quantify shortly. Retrieval systems are architecturally complex, introduce retrieval errors, and struggle with questions that require synthesizing information distributed across a document. A law firm parsing a 500-page merger agreement needs different context strategies than a customer service system retrieving from a product knowledge base.

The practical answer in 2026 is usually neither pure RAG nor pure long-context, but a tiered architecture: sparse retrieval to identify candidates, followed by full attention over retrieved sections. The decision calculus changed fundamentally once long-context inference became feasible. A few years ago the choice was forced: RAG, because there was no alternative. Now it is an architecture decision with cost, latency, accuracy, and operational complexity all in play.

Flash Attention solved the attention computation side of the problem. But it did not solve the other half of the memory challenge that grows with context length — the KV cache.


The KV Cache: Autoregressive Generation's Memory Tax

Autoregressive text generation is sequential by construction: to generate token N+1, the model must attend to all previous tokens 1 through N. Recomputing attention over all previous tokens at every step would make generation time scale quadratically with output length. The KV cache solves this by storing the key and value projections computed for each previous token, so each new generation step only needs to compute the query for the new token and attend over the cached keys and values for all previous ones.

This is the right trade — you pay in memory to avoid paying in compute. But the memory cost accumulates remorselessly with sequence length, batch size, and model scale.

The size of a KV cache entry per token is determined by: number of layers × number of KV heads × head dimension × 2 (one for K, one for V) × bytes per element. For a model like Llama 2 70B with 64 attention heads and a 4096-token context, the KV cache is roughly 2.5 GB, and every single token generation step requires reading all of that from GPU HBM. At 128K tokens with a similar architecture using full MHA (multi-head attention, the standard approach where each attention head maintains its own key and value projections), you are looking at approximately 80 GB of KV cache — the entirety of a single A100 or H100's VRAM, leaving no room for the model weights themselves.

At 128K context, a 70B model with full multi-head attention needs roughly 312 GB of KV cache versus approximately 39 GB with GQA-8 (Grouped-Query Attention with 8 key-value head groups, explained in the next section) — the difference between "fits on 8 GPUs" and "impossible."

The batch dimension compounds the problem in a way that catches organizations off guard. A single user running a 128K context inference on a 70B model is one thing. A production system handling 100 concurrent users, each with different context lengths, is entirely different. The KV cache for each request must be maintained independently in GPU memory throughout the generation process — you cannot share it across requests unless they share a common prefix. At production batch sizes, KV cache memory routinely exceeds model weight memory. A 70B parameter model in FP16 is roughly 140 GB, and at meaningful context lengths with meaningful batch sizes, the KV cache overtakes even that.

The static allocation model that naive implementations use makes this worse by a substantial margin. When a request arrives, the system does not know how long the response will be. The natural response is to reserve memory for the maximum possible output length upfront. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size.

Existing systems waste 60–80% of memory due to fragmentation and over-reservation. A system with 80 GB available for KV cache that wastes 70% of it through fragmentation effectively has 24 GB of usable KV cache. This caps batch size — and batch size is the primary lever for throughput in inference serving. Fewer concurrent requests per GPU means higher cost per token and lower hardware utilization.


GQA and PagedAttention: The Engineering Response to KV Cache Scaling

Two distinct techniques address the KV cache problem at different levels: GQA (Grouped-Query Attention) attacks the cache size at the architecture level, while PagedAttention attacks the memory management problem at the serving system level. They are complementary, and together they define the current standard for production LLM serving.

GQA, introduced by Ainslie and colleagues at Google in May 2023 and published at EMNLP (the Conference on Empirical Methods in Natural Language Processing), proposes an intermediate number of key-value heads — more than one but fewer than the number of query heads. The mechanism is straightforward: instead of each query head having its own K and V projections (multi-head attention) or all query heads sharing a single K/V pair (multi-query attention, or MQA), GQA groups query heads so that each group shares one K/V head. With 32 query heads and 8 KV heads, every 4 query heads share one KV head. This drops the KV cache size proportionally — by a factor of h/g, where h is the number of query heads and g is the number of groups — while maintaining much of the representational power of full multi-head attention. Uptrained GQA achieves quality close to full multi-head attention with speed comparable to multi-query attention.

The adoption trajectory was rapid. Following publication of the GQA paper in May 2023, many LLMs quickly adopted it. Meta first adopted GQA for its Llama 2 models in July 2023 and retained it in Llama 3. Mistral AI used GQA in the Mistral 7B model released in September 2023. Llama 3 uses 8 KV heads against 32 query heads — a 4× reduction in KV cache size relative to full MHA. Gemma models, Qwen 3, and essentially all current frontier open-weight models use GQA as the default. The DeepSeek series went further, introducing MLA (Multi-head Latent Attention) in DeepSeek V2, which compresses K/V tensors into a low-rank latent representation before caching, achieving even more aggressive KV cache reduction while preserving model quality.

A smaller KV cache means more concurrent requests. With GQA-8 instead of MHA-32, the KV cache is 4× smaller per request. On a GPU with fixed memory, you can serve 4× more concurrent users or handle 4× longer contexts before running out of memory. Because autoregressive decode is memory-bandwidth bound — reading the KV cache from HBM is the dominant operation per token — each decode step reads less data from memory. For memory-bandwidth-bound workloads, which is essentially all autoregressive decoding, less data to read means faster generation. The speedup is roughly proportional to the reduction in KV heads.

PagedAttention, introduced by Woosuk Kwon and colleagues from UC Berkeley in the paper "Efficient Memory Management for Large Language Model Serving with PagedAttention," published at SOSP 2023 (the ACM Symposium on Operating Systems Principles, one of the premier venues in systems research) where it received the Best Paper Award, addresses the memory management problem rather than the model architecture. It is an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. The vLLM (virtual Large Language Model) serving system built on top achieves near-zero waste in KV cache memory and flexible sharing of KV cache within and across requests.

The core mechanism directly mirrors OS virtual memory. Unlike traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. It partitions the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens. Just as a process addresses memory through a logical page table that maps to non-contiguous physical frames, each active sequence in vLLM addresses its KV cache through a logical block table that maps to non-contiguous physical blocks in GPU memory.

The practical impact is large. Where existing systems waste 60–80% of KV cache memory, vLLM achieves near-optimal memory usage with waste under 4%. The throughput improvements are not theoretical: vLLM improves the throughput of popular LLMs by 2–4× with the same level of latency compared to systems such as FasterTransformer (NVIDIA's optimized transformer inference library) and Orca (a prior academic serving system). LMSYS, the organization behind Chatbot Arena and Vicuna, cut the number of GPUs used to serve their growing traffic of approximately 45K daily requests by 50% while serving 2–3× more requests per second.

PagedAttention also enables prefix caching — multiple requests that share a common system prompt or document prefix can share the physical KV blocks for those prefix tokens, since they contain identical keys and values. PagedAttention's memory sharing reduces the memory overhead of complex sampling algorithms like parallel sampling and beam search by up to 55%, which can translate into up to 2.2× improvement in throughput. In production deployments where many users interact with the same system prompt or document context, this is not a marginal optimization — it can represent the majority of available KV cache memory.

Together, GQA and PagedAttention define the floor of what a serious inference deployment looks like in 2025–2026. Any serving infrastructure that omits both is leaving 2–4× throughput on the table by failing to manage KV cache efficiently, and potentially another 4× by using full MHA where GQA is available. vLLM, TensorRT-LLM (NVIDIA's optimized inference runtime), and SGLang (a structured generation inference framework from the Berkeley Sky Computing Lab) all implement both as defaults. The choice of inference framework is now, in substantial part, a choice about how well the KV cache is managed.


Context Window as a Cost Variable: The Decision Arithmetic

Context window length is a cost-per-token parameter with direct implications for latency, throughput, and infrastructure spend. Organizations that treat "larger context window = better" are making an accounting error.

The arithmetic works like this. For a given model and hardware configuration, the KV cache size grows linearly with context length. Every token in the context window adds a fixed memory footprint that must be loaded from HBM on each generation step. Doubling context length roughly doubles KV cache read bandwidth per token generated, which at decode time — already memory-bandwidth bound — roughly doubles per-token latency for the decode phase. Simultaneously, doubling context length halves the number of concurrent requests that fit in memory, which halves maximum throughput for a given GPU allocation.

Consider a concrete scenario. A legal-tech company is serving document review requests on an 8×H100 node with 80 GB per GPU, running a 70B model. Model weights in FP16 occupy approximately 140 GB — two GPUs worth of HBM for weights alone, leaving 480 GB for KV cache and activations. With GQA-8 on a 70B architecture, the KV cache per token per layer is substantially reduced. At a 32K token context — sufficient for many but not all documents — and a batch size of 16, the KV cache occupancy is manageable and throughput is reasonable. At 128K tokens, the KV cache per request expands fourfold, batch size must shrink proportionally, and latency-per-token increases. At 500K tokens — where some frontier models technically operate — you are running one or two requests at a time on that entire 8-GPU node, and each token requires reading hundreds of gigabytes of KV cache from HBM.

The inference pricing structures at major providers encode this tradeoff explicitly, even if they don't always label it transparently. Token costs at Anthropic, OpenAI, and Google scale with context for exactly this reason: the infrastructure cost per token is not constant across context lengths. A CAIO who runs cost projections for a document-heavy application using only the per-token list price is systematically underestimating cost for long-context workloads, because the effective throughput — and therefore compute cost per token — degrades as context grows.

The latency dimension is often more operationally significant than cost. For interactive applications — a lawyer asking follow-up questions while reviewing a contract — TTFT (time-to-first-token, the delay between submitting a prompt and receiving the first word of a response) scales with the prefill computation over the full context, which itself scales roughly linearly with context length. A 500K token context has roughly 16× the prefill latency of a 32K context, all else equal. User experience deteriorates well before memory runs out.

The jump to 141 GB of HBM3e and up to 4.8 TB/s on the H200 was specifically aimed at reducing off-chip traffic and stabilizing long context windows — which signals exactly how serious the memory bandwidth pressure becomes at production context lengths. The H200 carries the same compute die as the H100 but 76% more HBM bandwidth (4.8 vs. 3.35 TB/s) and 76% more memory (141 vs. 80 GB), specifically targeting the memory-bound LLM inference workload. The chip architecture roadmap is itself a reflection of the KV cache problem.

For a CAIO's architecture decisions, several practical conclusions follow from this arithmetic.

Context windows should be sized to workload, not to maximum capability. For most retrieval-augmented workflows, 8K–32K context covers the relevant extracted chunks with significant headroom, and operates in the efficient regime where batch sizes remain large and throughput is predictable. For workflows that genuinely require holistic document understanding — contract due diligence, code review across large repositories, multi-document synthesis where the relationships between passages matter — longer contexts are justified, but the cost should be modeled explicitly, not assumed to be proportional to a standard short-context workload.

Prefix caching should be a baseline requirement in any production deployment, not an optional optimization. When a 50-page legal agreement is prepended to every query in a document review session, every token in that agreement costs KV cache memory that can be amortized across all queries if the serving system implements prefix sharing. Without it, you pay the memory and bandwidth cost of that document on every single request. Most modern serving stacks support this; many deployment configurations leave it disabled.

The GQA configuration of the underlying model matters for cost projection, and it varies substantially across models. A model with 32 KV heads instead of 8 carries 4× the KV cache footprint at equal context length, which translates directly to 4× fewer concurrent requests in memory and correspondingly higher cost per request served. When selecting between frontier models for a high-throughput deployment, KV head count is an infrastructure cost variable comparable to parameter count.

The choice between cloud inference APIs and self-hosted inference is also mediated by these dynamics. Cloud providers have already made the Flash Attention and PagedAttention investments, and their pricing incorporates the efficiency gains. Self-hosted deployments can use those same gains through vLLM or TensorRT-LLM, but require operational investment in serving infrastructure, GPU management, and careful configuration of memory utilization parameters. The right settings depend heavily on your request distribution, target latency, and available hardware. A system tuned for short-context, high-concurrency request patterns will be configured very differently from one handling infrequent long-context requests — and misconfiguration in either direction materially degrades the economics.


The Signal in the Hardware Roadmap

There is a straight line from the physics described in this episode to the direction of the GPU market. Infrastructure teams use roofline analysis (a visual performance model that identifies whether a workload is limited by compute or by memory bandwidth) to make purchasing decisions: if your workload is memory-bound, as LLM decode is, buy GPUs with more bandwidth. The H200 was not designed to be faster than the H100 at training — it uses the same compute die. The H200 upgrades the memory subsystem: 141 GB HBM3e versus 80 GB HBM3, with 4.8 TB/s versus 3.35 TB/s bandwidth, while compute is identical. NVIDIA announced the B200 with HBM3e at 8 TB/s — the bandwidth increases continue to outpace the compute increases, because the memory bandwidth wall is the binding constraint and everyone knows it.

Flash Attention and the KV cache innovations are responses to a physics problem that the hardware vendors are attacking from the other side. The memory hierarchy gap — the vast disparity between what GPUs can compute and how fast they can be fed data — has driven years of systems research, architectural changes, and hardware investment simultaneously. That convergence is not coincidental. It reflects a real constraint that is, in 2026, still not fully resolved.

The CAIO who understands this constraint can reason about it from first principles rather than benchmarks alone. When a vendor claims a new chip delivers 3× better inference performance, the right question is whether that improvement comes from compute throughput or memory bandwidth — because for most production LLM workloads, only one of those numbers moves the needle. When an engineering team proposes moving to a larger context window to avoid RAG complexity, the right question is what happens to per-token cost and TTFT at the tail of the request distribution, not what the average case looks like. When a model with strong benchmark numbers uses full multi-head attention rather than GQA, the right question is what that means for the KV cache footprint at the context lengths your application uses.

The bottleneck is not compute. It never was. The bottleneck is the speed of the road between where the data sits and where the arithmetic happens — and every architecture decision you make for a production AI system will eventually resolve to some implication of that fact.