M8E1: Post-Hoc XAI to Mechanistic Interpretability: What We Actually Know

Module 8, Episode 1: Post-Hoc XAI to Mechanistic Interpretability — What We Actually Know


Approximating a Model Versus Understanding One

There is a moment in most organizations' AI governance journeys when someone discovers SHAP values and feels, briefly, that the interpretability problem is solved. The charts are clean. The feature importances are ranked. A bar labeled "credit utilization" or "token frequency in prior turns" extends confidently to the right, and the decision feels explained. This is the moment to be most suspicious.

LIME and SHAP are genuine contributions to machine learning practice. The distinction that matters is not their technical quality — it is what they are computing. LIME (Locally Interpretable Model-Agnostic Explanations) works by perturbing the input around a specific instance, observing how the model's output changes, and fitting a simpler interpretable model — a linear regression, a decision tree — to those perturbations. That surrogate model is then presented as the explanation. Notice what just happened: you trained a second, simpler model to approximate the behavior of the first model locally, and then you read off that surrogate's feature coefficients. You learned something about the surrogate. Whether the surrogate faithfully represents what the original model computed is a separate question — one that LIME cannot answer from the inside.

SHAP takes a more principled approach. Drawing on Shapley values from cooperative game theory, it asks: across all possible orderings in which features could be introduced into a prediction, what is the average marginal contribution of each feature to the final output? The mathematics is sound. Shapley values have uniqueness and efficiency properties that make them the only attribution method satisfying a specific set of fairness axioms. For tabular data and classical ML models like gradient boosted trees, SHAP can provide genuine insight, because those models operate on relatively low-dimensional, semantically meaningful features. When the "features" are what the model uses — when "loan-to-income ratio" is a column the tree splits on — Shapley attribution is meaningful.

Language models do not have features in this sense. A transformer does not process "topic: finance" or "sentiment: negative." It processes token embeddings flowing through residual streams, attended over by hundreds of heads implementing computations that may have nothing to do with any category a human would naturally impose. When you run SHAP on a language model, you typically perturb tokens — masking them, replacing them — and measure output changes. The result tells you which tokens, when removed, most change the output distribution. This is a behavioral measurement about perturbation sensitivity. It is not a measurement of what the model computed. A token can receive high SHAP attribution because it is contextually pivotal while doing nothing algorithmically interesting inside the model, or receive near-zero attribution while being essential to an intermediate computation that cancels out in output space.

The practitioner implication is direct. If your compliance framework, audit process, or regulatory filing relies on SHAP attributions as explanations of model reasoning, you have documented the behavior of a surrogate approximation. You have not documented what the model did. This gap matters less for auditing outputs — SHAP remains useful for catching systematic input-output correlations — and it matters enormously for any claim about causality, fairness, or mechanism. "This feature correlates with output changes" and "this is how the model produces outputs" are not the same statement. One is a behavioral correlation. The other is a causal account.


The Attention Visualization Trap

Before mechanistic interpretability became a research field with that name, the dominant approach to looking inside transformers was attention visualization. The logic seemed impeccable: transformers compute attention weights explicitly, those weights are interpretable as soft choices about which tokens to attend to, so visualizing the attention pattern reveals where the model is "looking." Papers shipped beautifully colored heat maps. Attention from the output token back to relevant input tokens lit up in ways that felt intuitively correct. The field briefly convinced itself it had cracked the problem.

It had not. Attention weights determine how much each value vector contributes to the current representation, but they say nothing about whether that contribution is meaningfully large relative to other terms, nothing about how the value vector itself is structured, and nothing about how the attended information combines with the residual stream already present at that position. A head can attend strongly to a token and write a near-zero update. A head can attend diffusely across many tokens and produce an enormous, decisive update. High attention to a token does not imply that token caused the output.

The deeper problem is that attention weights are not the right object of analysis. Transformers operate on residual streams: each token position maintains a vector that accumulates contributions from every layer, with attention heads and MLP blocks writing to and reading from that shared stream. What matters is the magnitude and direction of what each component writes into the residual stream, not which positions it reads from. Two attention heads with identical weight patterns but different value matrices will produce completely different effects. The visualization that has dominated XAI tooling for transformers measures the read pattern while leaving the write effect opaque. A concrete illustration: Jain and Wallace (2019) showed empirically that attention weights are uncorrelated with gradient-based feature importance measures on the same inputs, and that adversarially constructed attention distributions — patterns that look completely different — can produce identical model outputs. The heat map and the computation can be entirely decoupled.

Organizations have made deployment decisions based on attention visualizations — decisions about which inputs drive model behavior, which topics a model is focused on, and whether the model is attending to appropriate versus inappropriate evidence. Those decisions rested on a misunderstanding of what was being measured. The information that would license those conclusions — the causal structure of computation within the network — was not being accessed.


Why LLMs Are Fundamentally Harder to Interpret Than Classical Models

The failure of attention visualization points at something more general. Neural language models of the current generation are not just large versions of interpretable models. They are architecturally organized in ways that violate the assumptions on which most XAI intuitions rest.

Classical ML models — logistic regressions, gradient boosted trees, even shallow neural networks — operate in a regime where the assumption of approximate feature independence is defensible. Features are defined by humans, carry semantic content, and the model's predictions can often be sensibly decomposed along those features. Interpretability tools designed for this regime import assumptions about what a "feature" is, what "contribution" means, and how causes compose. Apply those assumptions to a transformer, and they break almost immediately.

The first failure mode is polysemanticity. In a language model's hidden layers, individual neurons — individual dimensions of activation vectors — do not cleanly correspond to single human-interpretable concepts. A single neuron might activate on academic citations, specific programming syntax, and references to a particular historical era — patterns that share no obvious common meaning but have learned to share a representational slot. When you ask SHAP to attribute importance to input features and that attribution flows through polysemantic intermediate representations, the causal chain it traces is not the model's reasoning. It is an artifact of the projection from the model's distributed representations onto human-interpretable input tokens.

The second failure mode is superposition, which is the mechanism behind polysemanticity and considerably stranger than polysemanticity alone. Neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. A vector space of dimension d can hold at most d orthogonal directions — but it can hold exponentially many nearly-orthogonal directions, each of which interferes only slightly with the others. If activation patterns are sparse — if, at any given moment, only a small fraction of the features the model has learned are relevant — then these nearly-orthogonal directions can coexist without catastrophic interference. The model gets more representational capacity than its nominal dimensionality would suggest, at the cost of mixing features together in ways that are invisible to per-neuron analysis.

The third failure mode is emergence. Behaviors that exist in large models do not exist as degraded versions of behaviors in small models — they can appear discontinuously, enabled by combinations of components that individually appear to do nothing of note. The capabilities of GPT-5 and Claude 4 that make them useful for complex reasoning tasks are not present in miniature form in a 7-billion-parameter model and merely scaled up. They may depend on circuits that simply do not exist below a certain scale threshold. If features and circuits are universal, insights from studies on small models can transfer to larger ones. If they are not, a significant amount of independent effort will be required to interpret each unexamined model on each task.

These three failure modes together explain why the XAI toolkit that serves adequately for credit scoring models or fraud detection classifiers produces something much closer to an illusion when applied to frontier language models. The tools are asking the wrong questions, on the wrong level of analysis, about architectures the tools were not designed to handle.


Mechanistic Interpretability: What Reverse Engineering Actually Means

The field that now calls itself mechanistic interpretability began from a different starting premise. Rather than asking "what input features influence outputs," it asks: what computation does this network implement? The goal is a precise, mechanistic account of what the weights do — not a convenient approximation of the model's input-output function.

Mechanistic interpretability is distinguished by a specific focus on systematically characterizing the internal circuitry of a neural net. The analogy is to how a compiler engineer might reverse-engineer an executable: not by observing how programs behave, but by working from the binary backwards to the underlying operations. The same input-output correlation can be produced by wildly different underlying computations, and if you care about safety, reliability, or the conditions under which behavior will generalize, the underlying computation is what you need.

The foundational transformer circuits work by Elhage, Nanda, Olsson, and collaborators at Anthropic — published in A Mathematical Framework for Transformer Circuits in 2021 — developed exactly this kind of decomposition. The key technical move was to analyze transformer computation not in terms of layers but in terms of the paths through which information flows. In a transformer, each layer's output adds to the residual stream rather than replacing it, which means you can express the final logits as a sum of contributions from every possible path through the network: the direct path from the embedding, paths through individual attention heads, paths through MLP layers, paths that compose multiple components. This residual stream framework makes it possible to identify which components are responsible for which aspects of the output — not by perturbation but by algebraic decomposition.

The most concrete early result was the identification of induction heads. The induction head is a circuit whose function is to look back over the sequence for previous instances of the current token — call it A — find the token that came after the previous instance of A, and predict that same token will follow again. The circuit consists of two attention heads: a previous-token head that encodes the information that one token follows another in the residual stream, and an induction head that reads that encoding to promote the expected continuation as the next token prediction.

This is a genuinely mechanistic finding. Not a description of behavior, but an account of which components produce that behavior, how they compose, and what computation they implement. The induction head is not a metaphor for pattern completion. It is a specific algorithm: find-previous-occurrence, read-what-followed, promote-that-token. Transformer language models undergo a phase change early in training during which induction heads form and in-context learning improves dramatically. When the transformer architecture is changed in ways that shift whether induction heads can form, the improvement in in-context learning shifts in a precisely matching way. When induction heads are directly knocked out at test time in small models, in-context learning decreases sharply. Critically, this causal verification — not just correlation between head formation and capability, but direct ablation producing predictable capability loss — is what separates the induction head result from a behavioral observation dressed in mechanistic language.

The indirect object identification (IOI) result from Wang et al., 2022 pushed this approach into more naturalistic language tasks. Given sentences of the form "John gave Mary the book; then John told ____," a language model must identify that the blank should be filled with Mary, not John — the indirect object, not the subject. Wang et al. traced the full explanation for how GPT-2 Small performs this task, encompassing 26 attention heads grouped into seven main classes, discovered using a combination of interpretability approaches relying on causal interventions. The circuit includes name-mover heads that copy the indirect object name to the output position, duplicate token heads that identify repeated names and suppress them, and inhibition heads that suppress the subject name. This is a specific set of components with specific roles, connected by specific information flows, whose collective computation produces the correct output.

The significance of these results for governance is substantial — and so is the caveat. These are findings about GPT-2 Small and similarly sized toy models. Whether the specific circuits identified in two-layer attention-only models or GPT-2's 12 layers persist, in recognizable form, in Claude 4 or GPT-5 is an active research question. The mechanisms may be universal. They may not. The work of scaling mechanistic interpretability to frontier models is unfinished.


The Superposition Hypothesis and the Sparse Autoencoder Response

What makes the superposition hypothesis so unsettling is that it suggests the entire frame of "what does this neuron represent" is wrong — not because neurons don't represent things, but because they represent too many things simultaneously, and disentangling them requires techniques that operate at a different level of analysis entirely.

Elhage et al. hypothesize that polysemanticity results from models learning more distinct features than there are dimensions in the layer. Since a vector space can only have as many orthogonal vectors as it has dimensions, the network learns an overcomplete basis of non-orthogonal features. If the model needs to track ten thousand features but only has a thousand dimensions to work with, it can do so — imperfectly, with interference — by representing features as directions in the high-dimensional space rather than as individual dimensions. The crucial enabling condition is sparsity: if most features are inactive on any given input, the interference between features remains acceptably small, and the model maintains its compressed representation without catastrophic confusion.

This creates a specific technical problem. If features are directions in activation space rather than individual neurons, and if those directions are overcomplete, how do you find them? The answer the field has converged on is sparse autoencoders (SAEs). The approach is dictionary learning: train an autoencoder that takes a model's internal activations as input, expands them into a much higher-dimensional latent space, enforces sparsity so that only a small number of latent dimensions are active at once, and then reconstructs the original activations from that sparse representation. SAEs attempt to identify those overcomplete directions in activation space by reconstructing the internal activations of a language model under a sparsity constraint.

The Anthropic team's 2023 paper Towards Monosemanticity: Decomposing Language Models with Dictionary Learning was the pivotal demonstration. They used a sparse autoencoder to generate learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves. The architecture was applied to a one-layer transformer, specifically to decompose the MLP activations. Their approach — dictionary learning with a 16× expansion trained on 8 billion residual-stream activations — extracted nearly 15,000 latent directions where human raters found 70% cleanly mapped to single concepts like Arabic script or DNA motifs. Seventy percent monosemanticity, by human evaluation, against features that were invisible and uninterpretable in the raw neuron basis.

The 2024 follow-up, Scaling Monosemanticity, applied the same method to Claude 3 Sonnet — a production system, not a toy model. The SAE they trained contained approximately 34 million features. Among those, they found a specific combination of neurons that activates when the model encounters a mention or image of the Golden Gate Bridge. They could identify these features, tune the strength of their activation up or down, and observe corresponding changes in Claude's behavior.

The demonstration that followed was both scientifically significant and strategically clarifying. The researchers explored feature steering using a technique called feature clamping, where the activations in the latent space of the SAE are hard-coded to a specific value. The activations from the middle layer of the LLM are then passed through the SAE before continuing through subsequent layers, so the decoder is influenced by the clamped feature rather than reconstructing the input activations as designed during training. This is the method used to create Golden Gate Claude — a version of the model so dominated by the Golden Gate Bridge feature that when asked to describe its physical form, it responded that it was the iconic suspension bridge spanning San Francisco Bay. The feature was not a label imposed from outside. It was a direction learned by the SAE that had causal influence on the model's outputs. Despite scaling SAEs to 34 million features, Anthropic estimates there likely remain orders of magnitude more features to be found.

The scaling results also revealed something that should give pause to any confident narrative about SAEs as the complete solution. Ensuring that extracted features remained monosemantic became increasingly difficult, because feature superposition is more prevalent in larger models. More capable models compress more features more aggressively, which makes the SAE problem harder precisely where you most want it to work. And there is a deeper issue: the SAE features are learned, not derived from first principles. They are the latent dimensions that minimize reconstruction error under sparsity constraints. There is no guarantee they correspond to the features the model uses in any deep computational sense — they are the most parsimonious basis that fits the activations, which is suggestive but not definitive.


What This Field Has Proved, and What It Has Not

The circuits results and the SAE results are, by any reasonable standard, more genuine than anything the XAI toolkit produced for transformers. They identify specific computational structures, verify their causal roles through ablations and interventions, and demonstrate that artificially activating a feature produces predictable changes in model behavior. This is evidence of mechanism. It is not mechanism at the level of complete understanding.

The IOI circuit in GPT-2 Small accounts for the behavior of 26 attention heads on one specific linguistic task. GPT-2 Small has 144 attention heads across 12 layers and approximately 117 million parameters. The portion of the model's computational budget that has been given a precise mechanistic account remains, across all circuits work to date, a small fraction of even small models. At frontier scale, the gap is staggering. Claude 4 and GPT-5 operate at parameter counts orders of magnitude larger, with depths and widths where the composition of circuits into larger computations is genuinely unexplored territory.

Feature steering can lead to unpredictable changes across model outputs. When a single feature is steered, unexpected changes sometimes appear in model selections across domains not directly related to the steered feature. Steering can also compromise response quality and relevance at extreme steering values. This is Anthropic's own empirical feedback from their follow-up evaluation of the technique. Feature steering works. It is not a clean surgical instrument. Features at the scale SAEs identify are not independent modules — they are directions in an entangled representational space, and perturbing one direction ripples through the network in ways that are, at present, difficult to fully anticipate.

There is also a methodological tension in the field that serious practitioners should hold. The SAE training process optimizes reconstruction fidelity plus sparsity — it finds the most parsimonious sparse dictionary that approximates the activations. "Most parsimonious" and "causally fundamental" are not the same thing. The central claim of the Towards Monosemanticity paper — that dictionary learning can extract features significantly more monosemantic than neurons — is well-supported by the evidence. Whether those features are the actual atomic units of computation in the model, or whether they are a useful approximation of a more complex representational structure, is a harder question the research community is actively working to answer. One concrete indicator of this tension: when SAEs trained on different random seeds are compared, the learned feature dictionaries are similar but not identical, which suggests the decomposition is finding a good basis rather than the unique true basis implied by a fully causal account.

The Transformer Circuits Thread at Anthropic, now producing results on Claude 3.5 Haiku (Anthropic's lightweight production model) and beginning to address full production models, has also developed attribution graphs that trace computation through models at a finer grain than feature activation alone. A method to explain attention patterns in terms of feature interactions has been developed and integrated into these attribution graphs. This is ongoing science, not a deployed audit capability. The toolchain for applying mechanistic interpretability to a frontier model at governance scale does not yet exist in productized form.


The Governance Gap

The picture that emerges from the transition from post-hoc XAI to mechanistic interpretability is neither the one XAI practitioners want to hear nor the one AI skeptics typically offer. It is stranger and more specific than either.

Post-hoc XAI methods are not fraudulent. For classical ML models — for the gradient boosted tree your fraud team uses, for the logistic regression embedded in a credit underwriting workflow — SHAP and LIME provide genuine information about model behavior that is relevant to governance. The mistake is assuming these tools transfer to language models. They do not, for structural reasons that are now well-understood, and continuing to use them as if they do produces regulatory filings, audit reports, and governance frameworks that rest on mischaracterized evidence.

Mechanistic interpretability is building the first genuine explanations of how neural networks compute. The induction head result is real. The IOI circuit is real. The SAE features in Claude 3 Sonnet are real and causally active. What sets this work apart is the emphasis on understanding the mechanistic building blocks underlying models — with the goal of eventually reverse-engineering them — rather than giving humans tools to predict model behavior without those tools having any correspondence to what the models are actually doing internally.

The maturity gap is also real. As of 2024, the mechanistic interpretability team at Anthropic had grown to 17 people, representing a significant fraction of the estimated 50 full-time mechanistic interpretability researchers worldwide. The community is small. The tools require significant ML expertise to apply. The results so far cover small models, specific tasks, and single circuits or feature sets — not the full computational fabric of a frontier model. Scaling SAEs to 34 million features on Claude 3 Sonnet was a major technical achievement that required months of compute. Running that process on every model version, for every deployment context, for governance purposes, is not yet feasible.

For decision-makers deploying or regulating frontier AI systems, this creates a specific epistemic situation. The tools available at scale — SHAP, attention visualization, behavioral evaluations — are genuine but limited, and their limitations are structural rather than incidental. The tools that would provide mechanistic accounts are real, early-stage, and require ML research capability to apply. There is no audit-ready mechanistic interpretability toolkit for Claude 4 or GPT-5. There may be within two or three years; the field is moving at a pace that would have seemed implausible in 2021.

The productive response is to be precise about what your evidence licenses. "Our SHAP analysis shows that model outputs are sensitive to input token X" is a true and useful statement. "Our SHAP analysis explains why the model produces output Y" is an overreach — and the gap between those two sentences is exactly where mechanistic interpretability is doing its work.

The strangest finding from this body of research is not that models are more opaque than we thought. It is that they may be more structured than we feared — that specific algorithms, implemented by specific circuits, underlie behaviors that previously appeared as undifferentiated emergence. That structure is becoming legible, slowly and with difficulty. The problem of interpretability is not solved. It is, for the first time, being addressed with tools adequate to the architecture. You now have enough understanding to know precisely how much you do not yet know.