M7E1: Mamba and State Space Models: The Transformer's Most Serious Challenger
Mamba and State Space Models: The Transformer's Most Serious Challenger
The Quadratic Tax
Every transformer you have ever run is paying a computational tax that grows with the square of its context length. The cost is structural. When a transformer processes a sequence of n tokens, the attention mechanism computes pairwise similarity between every token and every other token. The resulting attention matrix has n² entries. Double the context length, quadruple the compute. Extend to a hundred thousand tokens, and the attention computation alone scales to ten billion operations just for that layer. The transformer's power is inseparable from this cost: attention works precisely because it can route information between any two positions in the sequence without degradation. But that universality carries a price that becomes prohibitive at the sequence lengths real enterprise workloads increasingly demand — multi-hour meeting transcripts, entire legal dossiers, genomic sequences, long-horizon code repositories.
Many subquadratic-time architectures — linear attention, gated convolution and recurrent models, and structured state space models (SSMs) — have been developed to address this computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. That last clause is the crux of the problem. Efficiency is easy to achieve; what has proven elusive is efficiency without quality degradation. For years, every challenger that managed to lower the computational complexity also lowered the performance ceiling.
The situation has changed. State space models are not simply a more efficient approximation of transformers. They rest on a different mathematical foundation, they excel at structurally different tasks, and the best evidence from 2024 and 2025 suggests the right answer is not to choose between transformers and SSMs but to understand when and how to combine them. The 2025 hybrid architectures are not a compromise born of uncertainty. They are a principled settlement, arrived at empirically, that resolves a genuine theoretical tension. Understanding why that tension exists, and why it resolves the way it does, is what this chapter is for.
The State Space Intuition: From Control Theory to Sequence Modeling
The state space model is not an invention of the deep learning era. It comes from control theory, where engineers have long described dynamical systems through a compact formalism: a hidden state that evolves over time in response to inputs, and an output that is read off from that state. A promising approach proposed modeling sequences by simulating the fundamental state space model — formally, x-prime of t equals A times x of t plus B times u of t, and y of t equals C times x of t plus D times u of t — and showed that for appropriate choices of the state matrix A, this system could handle long-range dependencies mathematically and empirically. The elegance here is that the entire history of the sequence is compressed into the hidden state vector — not stored explicitly, but projected into a fixed-dimensional representation that evolves deterministically. This is linear time, O(n), because processing each new input requires only updating the hidden state and reading the output; there is no growing matrix of pairwise comparisons.
The problem, historically, was making this compression work for the rich, discrete, information-dense sequences that language comprises. The state matrix A is the heart of the system. Choose it poorly, and distant information decays exponentially in the hidden state; the model suffers the same vanishing-gradient pathology that plagued early recurrent neural networks. The key insight that unlocked SSMs for serious sequence modeling was the theory of HiPPO — High-order Polynomial Projection Operators — which provides a mathematically principled way to initialize A such that the hidden state optimally tracks a compressed history of the input. Concretely, HiPPO constructs A so that the hidden state at each timestep maintains the coefficients of a polynomial approximation to the entire input history, weighted by a measure that emphasizes recent inputs while preserving older ones at a rate proportional to their distance. This is not a heuristic initialization; it is a closed-form solution to the problem of what the state matrix should be if the goal is optimal history compression. From HiPPO came the line of research that produced S4.
Gu, Goel, and Ré's S4 architecture, published in 2021, was the first SSM architecture to demonstrate that this class of models could match or exceed transformers on genuinely difficult long-sequence benchmarks. S4 achieves strong empirical results across a diverse range of established benchmarks, including 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, substantially closing the gap to transformers on image and language modeling tasks while performing generation 60 times faster, and state-of-the-art results on every task from the Long Range Arena benchmark — including solving the Path-X task at sequence length 16,000 that all prior work had failed entirely. That Path-X result was not incremental progress. It was a qualitative demonstration that SSMs could handle the kind of very long-range dependency that transformers genuinely struggle with at scale.
The mechanism behind this capacity is structural. S4 uses a new parameterization of the state matrix A — specifically, a low-rank correction that allows A to be diagonalized stably, reducing the core computation to a well-understood mathematical form called a Cauchy kernel. The efficiency gain during inference comes from an option unavailable to transformers: S4 can be unrolled as a recurrence, processing tokens one at a time with constant memory, or parallelized as a global convolution during training, processing the entire sequence simultaneously. The model switches modes depending on whether it is training or generating.
But S4 had a fundamental limitation that explains why it did not simply replace transformers on language tasks. Its state transition matrices — the parameters governing how the hidden state evolves — are fixed across the entire sequence. They do not change based on what the input contains. This property is called linear time-invariance, and it means the model applies the same transformation rules to every token, regardless of whether that token is semantically crucial or irrelevant filler. For continuous signals like audio, where the underlying physics really is approximately time-invariant, this works well. For language, where a pronoun's importance depends entirely on what noun it refers back to three hundred tokens ago, fixed transition rules are a crippling constraint.
Ordinary SSMs are designed to map input to output using the entire input history. This is acceptable or even desirable for some sequence modeling tasks, but a significant handicap for most advanced language modeling tasks. What S4 demonstrated in 2021, then, was a proof of concept: SSMs could handle long-range dependencies with striking efficiency. What they could not yet do was reason about those dependencies in a content-aware way — could not selectively decide, based on what they were reading, what to remember and what to forget.
Mamba and the Selective Scan: Content-Awareness Without Attention
The breakthrough in Gu and Dao's Mamba architecture, published in December 2023, was conceptually clean and practically decisive. A key weakness of prior subquadratic models is their inability to perform content-based reasoning. Letting the SSM parameters be functions of the input addresses this weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension. That deceptively simple observation — make the transition parameters input-dependent — transforms the S4 architecture into something qualitatively different.
In Mamba, the state matrices B and C, along with the step size Δt (which governs the discretization of the continuous-time system), are all functions of the current input token. They are not fixed; they are computed freshly for every position by passing the input through small linear projection layers. The result is what the paper calls a Selective State Space Model, or S6 — the same structured state space machinery as S4, but now equipped with input-conditioned parameters that allow the model to modulate, moment to moment, how much of the current input gets written into the hidden state and how much of the existing state gets retained. A large Δt causes the model to forget the past quickly and weight the current token heavily. A small Δt preserves the existing state with minimal perturbation from the current input. The model learns, from data, which regime is appropriate for which contexts — essentially learning a policy for what is worth remembering.
The selective state space model gives Mamba a capability previously possessed only by transformer models: the ability to focus on or ignore specific parts of past input history based on their present relevance. The comparison to transformer attention is instructive but imprecise. Attention is explicitly global: at every position, it computes a weighted combination over all prior positions simultaneously, with weights determined by the query-key dot product. Mamba's selectivity is sequential — it gates the flow of information through a hidden state that accumulates over time. The model cannot, in a single operation, retrieve a specific token from three hundred positions ago the way attention can. It can only influence what was retained in the hidden state as the sequence unfolded. This distinction matters enormously for understanding Mamba's strengths and weaknesses.
The other half of the Mamba innovation is hardware-aware design. Making the SSM parameters input-dependent destroys the ability to implement the model as a pure global convolution — you cannot precompute the kernel when the kernel changes for every input. Even though this change prevents efficient convolutions, a hardware-aware parallel algorithm in recurrent mode is designed to address this. The result is a parallel scan algorithm that processes the sequence recurrently but exploits GPU parallelism by operating in high-bandwidth SRAM rather than the slower off-chip memory where most large tensors live. This is the same philosophy as FlashAttention — the insight that for modern hardware, memory bandwidth is often the binding constraint, not raw arithmetic throughput — and it is what makes Mamba's theoretical linear complexity translate into real empirical speed gains. The parallel scan fuses the recurrent update into a single GPU kernel, avoiding the repeated round-trips between SRAM and high-bandwidth memory that would otherwise negate the arithmetic savings of linear-time recurrence.
Mamba delivers five times higher throughput than transformers and scales linearly in sequence length, with performance improving on real data up to million-length sequences. At inference time the advantage is especially stark. A transformer serving a 100,000-token context must maintain a key-value cache proportional to that context length for every request in every layer. Mamba serves the same context with a fixed-size hidden state — the same memory footprint whether the input is 1,000 tokens or 1,000,000.
The scaling results from the original paper validated the architecture at language model scale. The Mamba-3B model outperforms transformers of the same size and matches transformers twice its size, both in pretraining and downstream evaluation. Equivalent quality at half the parameters, with linear rather than quadratic scaling. At that point in late 2023, the question was whether this would hold at larger scales and harder tasks — whether Mamba was a genuinely general architecture or an efficient specialist suited to particular regimes.
Where Mamba Struggles: The Retrieval and Reasoning Gap
The strongest objection to treating Mamba as a transformer replacement comes from a clear empirical pattern: pure SSM architectures underperform transformers on tasks that require precise, targeted retrieval from long contexts. When you need to find a specific fact in a long document — exact copy, needle-in-a-haystack retrieval, multi-hop reasoning chains that depend on tracing a specific reference — the fixed-size compressed hidden state is a lossy representation. The transformer's attention mechanism, by contrast, attends directly to any prior token without information loss.
Recent state space models like Mamba are more efficient to train than recurrent neural networks and more capable at handling long-distance relationships, but still lag behind comparably sized transformer language models. More specifically, pure Mamba fails in cases where it struggles to develop in-context learning capabilities — the ability to update behavior based on examples provided in the prompt. In-context learning requires a mechanism for directly attending to those examples, which a compressive recurrent state cannot provide reliably.
There is a theoretical substrate to this limitation. SSMs have been shown to lack certain capabilities, including poor state-tracking abilities — for example, simply determining the parity of bit sequences. This is a startling result: there exist simple computational tasks, problems a first-year computer science student can solve, where state space models systematically fail even when given sufficient capacity. State tracking requires the model to remember a specific piece of information precisely, without the compression that makes SSMs efficient. The parity example is illuminating precisely because of its simplicity: to determine whether a sequence of bits contains an odd or even number of ones, a model must maintain a single running bit of information — current parity — through every position in the sequence without error. An SSM's hidden state, optimized for smooth polynomial compression of history, cannot reliably preserve that single sharp bit across long sequences, because the compression mechanism that makes it efficient at distributing contextual information over time is structurally antagonistic to lossless exact-recall operations. The same efficiency that makes Mamba fast at processing long audio sequences makes it lossy when the downstream task demands exact recall.
None of this makes Mamba unimportant. It makes it architecturally specialized in a way that the initial excitement sometimes obscured. Mamba is a superb architecture for tasks where the relevant signal is distributed and contextual — understanding the overall tone and structure of a long document, processing continuous biological sequences, generating audio where temporal coherence matters over many timesteps. But for the tasks that define frontier language model quality benchmarks — multi-step reasoning, precise factual retrieval, complex instruction following — the transformer's exact-attention mechanism carries a capability advantage that Mamba's selectivity does not fully replicate.
That gap is a signal about where the architectures need each other.
The Hybrid Settlement: Jamba and the Architecture of Division of Labor
If transformers are expensive but precise and SSMs are cheap but lossy, the obvious question is whether you can build a system that uses each where it excels. Jamba — developed by AI21 Labs and released in March 2024 — is a base large language model built on a hybrid Transformer-Mamba mixture-of-experts architecture. Specifically, Jamba interleaves blocks of transformer and Mamba layers, drawing on the strengths of both model families. It was the first production-grade demonstration that this interleaving could be made to work at scale, and the architectural decisions AI21 Labs published reveal something deeper than engineering compromise.
Each Jamba block contains either an attention or a Mamba layer, followed by a multi-layer perceptron, producing an overall ratio of one transformer layer out of every eight total layers. That 1:7 ratio was discovered empirically through ablation studies, and those ablations revealed a structural insight. The most striking design principle to emerge was about layer placement: the studies showed a clear rule — never place transformer blocks at the front of the model. The initial layers, all Mamba layers in Jamba's design, act as efficient processors and feature extractors, scanning the entire long input sequence linearly and compressing relevant information into their hidden states. The sparsely interspersed transformer layers, placed later in the network, then perform their more computationally expensive global attention operations on these already-processed, more condensed representations.
This is a story about functional specialization across depth, not one architecture tolerating the other. The early Mamba layers do what SSMs do well: scan enormous inputs, extract distributed statistical structure, build a compressed but useful representation of the sequence as a whole. The later attention layers do what transformers do well: perform precise, content-targeted reasoning over the already-refined representation. The architecture has discovered a pipeline. Mamba is the pre-processor; attention is the reasoner.
AI21 Labs scaled its hybrid Mamba–Transformer architecture to 398 billion total parameters with 94 billion active — marking the first large-scale deployment of such a model. Jamba 1.5 interleaves Mamba and attention layers across 72 layers, using grouped-query attention, low-rank adaptation, and 16 Mixture-of-Experts (MoE) routing experts for efficient routing. The MoE component is not incidental. Jamba's MoE layers allow it to draw on just 12 billion of its available 52 billion parameters at inference, and its hybrid structure renders those 12 billion active parameters more efficient than a transformer-only model of equivalent size. The practical result: Jamba 1.5 supports 256,000-token context and achieves top scores on long-context benchmarks, including state-of-the-art performance on NVIDIA's RULER benchmark — one of the more demanding long-context evaluations available, testing structured retrieval and multi-hop reasoning over very long inputs. That is exactly the regime where pure SSMs have historically struggled, and a hybrid architecture achieving state-of-the-art results there is the clearest empirical evidence that the hybrid settlement is a synthesis, not a concession.
The throughput numbers make the business case concrete. Jamba delivers three times the throughput on long contexts compared to Mixtral 8x7B, a leading open-weights model from Mistral AI. For an enterprise running document analysis, contract review, or retrieval-augmented generation pipelines over large corpora, that multiplier translates directly into infrastructure cost. In its particular configuration, Jamba fits in a single 80-gigabyte GPU — a deployment footprint that pure transformer architectures at comparable quality typically cannot match at these context lengths.
The Mamba-3 paper, published at ICLR 2026, extends this line of work further, making the state-transition matrix fully data-dependent alongside the existing input-dependent parameters, addressing known weaknesses in state-tracking expressivity. Mamba-2 and Gated DeltaNet layers have been incorporated into large-scale hybrid models that match the performance of pure transformer alternatives with substantially higher efficiency. The direction of travel across the field is consistent: no one building frontier-scale long-context models is ignoring SSMs.
Diffusion Models: A Structurally Different Kind of Challenger
Before arriving at the strategic implications of the hybrid consensus, it is worth acknowledging that the alternative-architecture landscape includes a third family competing on entirely different terms: diffusion models. Understanding diffusion is necessary because it represents the dominant production architecture for image and video generation — workloads where neither transformers nor SSMs have fully displaced it.
The key paper is Ho, Jain, and Abbeel's DDPM (Denoising Diffusion Probabilistic Models), published in 2020. The approach defines a forward process that gradually corrupts a training image by adding Gaussian noise over many timesteps until the image is indistinguishable from pure noise. A neural network is then trained to reverse this process — to denoise a slightly corrupted image into a slightly less corrupted one. At generation time, you start from pure noise and iteratively apply the denoising network, stepping from high-noise to low-noise until you arrive at a coherent image.
The mathematical machinery that makes this work is score matching. Score matching is a training objective where the model learns the gradient of the log-density — called the score function — of the data distribution. Denoising score matching, a variant of this method, estimates the score function for noisy versions of the data distribution. In DDPMs, the reverse process is parameterized in a way that makes the training objective equivalent to denoising score matching over various noise levels. Rather than learning to produce samples directly, the model learns to estimate, at any noise level, which direction in pixel space points toward higher probability under the data distribution. This gradient estimate — the score — guides the denoising trajectory from noise back to data.
Ho et al. showed an equivalence between denoising diffusion probabilistic models and score-based generative models, which independently learn a gradient of the log-density of the data distribution using denoising score matching. This equivalence, also articulated in Song and Ermon's concurrent work on score-based models, unified two previously separate research traditions and set the stage for rapid improvement in image quality that followed.
Stable Diffusion operationalized DDPM by working in the compressed latent space of a pretrained variational autoencoder rather than in pixel space directly, dramatically reducing the dimensionality of the denoising problem and making high-resolution generation practical. DALL-E 3 incorporated improved text-image alignment techniques. Sora applied diffusion principles to video by treating videos as patches in space-time — a radically higher-dimensional problem that required rethinking both the denoising network architecture and the conditioning mechanism. The underlying principle across all these systems is the same: learn to reverse corruption, guided by score matching.
Diffusion models, in their standard formulation, are not sequence models. They generate all elements of the output simultaneously across many refinement steps, rather than autoregressively. This makes them well-suited for images and video, where spatial coherence is the primary constraint, and less naturally suited to language, where the output is inherently sequential and the quality metric is logical consistency rather than perceptual fidelity. Recent work on diffusion language models — applying the denoising framework to discrete token sequences — is active and technically interesting, but as of 2026 has not produced a language model competitive with the transformer-SSM hybrid frontier on standard reasoning benchmarks.
For the Chief AI Officer thinking about architectural strategy, diffusion models occupy a distinct category: they are the established production architecture for image and video workloads, they are not being displaced by transformers or SSMs in those domains, and the interesting current research concerns scaling them — larger diffusion transformer backbones, better conditioning, faster sampling via flow matching — not replacing them.
The Strategic Meaning of Architectural Pluralism
The 2025 consensus on hybrid architectures carries a strategic implication that extends well beyond the choice of which model to deploy today. For roughly five years, the transformer was not merely the dominant architecture — it was the only architecture worth taking seriously for frontier language tasks. This created a convenient but misleading simplification: understanding AI capability meant understanding transformers. The lesson of Mamba, S4, Jamba, and Mamba-3 is that this monoculture was empirically contingent, not theoretically necessary.
Large-scale hybrid models match the performance of pure transformer alternatives with substantially higher efficiency. Progress remains in improving their performance, in particular on advancing the Pareto frontier between model quality and inference efficiency. That Pareto frontier framing is exactly right. The question is not "which architecture wins" but "what combination of components achieves the best quality-efficiency tradeoff for a given workload." At a fixed quality level, hybrids are cheaper to run at long context. At a fixed compute budget, hybrids can process longer context than pure transformers. Neither claim requires the SSM to be universally superior; both require only that the hybrid exploit structural complementarity.
Transformers have remained the dominant mode of large language model in the years since the original Mamba paper, but the architecture has been incorporated into a growing number of open-source models. Some, such as Mistral AI's Codestral Mamba, are pure Mamba models. Many more — including AI21's Jamba series and IBM's Granite 4.0 — are hybrid models incorporating both attention and SSM layers. The fact that IBM is building hybrid SSM-attention layers into its enterprise Granite series is a signal worth registering. This is not academic experimentation. It is a production decision made by an organization with significant infrastructure stakes in correctness. IBM's adoption is particularly telling because Granite targets regulated enterprise verticals — financial services, healthcare, legal — where inference cost at long context is a first-order operational constraint, not a secondary optimization.
The deeper point is about what architectural diversity means for AI governance, procurement, and capability evaluation. An organization evaluating a long-context model for document intelligence workflows should be asking not just "what is the quality on MMLU (Massive Multitask Language Understanding, a standard AI benchmark) or GPQA (Graduate-Level Google-Proof Q&A, a benchmark of expert-level reasoning questions)" but "what is the architectural composition, and does it match the retrieval-versus-synthesis balance of our actual workloads?" A hybrid model with a 7:1 Mamba-to-attention ratio and 256,000-token context will serve differently from a pure transformer with flash attention and a sliding window — not better or worse in the abstract, but differently in ways that matter for specific tasks. The evaluator who understands that distinction makes a procurement decision; the one who does not makes a benchmark-chasing decision.
There is also an inference cost implication that is strategic rather than merely operational. The key-value cache scaling problem for transformers is not solved by FlashAttention or grouped-query attention — those techniques reduce constants, not the asymptotic scaling. As context windows extend into the hundreds of thousands of tokens that retrieval-augmented pipelines increasingly demand, the operational advantage of hybrid architectures compounds. The enterprise running a retrieval-augmented generation pipeline over a 200,000-token document corpus is not just choosing a model quality level; it is choosing an infrastructure cost structure for the lifetime of that deployment.
The transformer will not disappear. Attention, for tasks where it matters, remains the most powerful mechanism available for content-targeted reasoning. But the period of architectural monoculture — where every serious frontier model was a pure transformer, differing only in scale and training recipe — has ended. The Mamba lineage proved that a principled alternative was possible. Jamba proved that hybrids work at production scale. Mamba-3 proved the research is accelerating rather than converging.
Carry this out of the chapter as a revised prior about what "architecture" means when evaluating AI systems for deployment. Knowing that a model is transformer-based is no longer sufficient. The question is which layers are doing which kind of computation, what the ratio of exact attention to compressive recurrence is, and whether that ratio was tuned for the workload you have. The field has moved past the era of one-architecture answers, and the organizations that understand why hold a real analytical advantage over those still treating transformer depth and parameter count as the only variables that matter.