M3E2: MoE Architecture: How GPT-4, Gemini, and Mixtral Actually Work
MoE Architecture: How GPT-4, Gemini, and Mixtral Work
Specialization as a Theory of Intelligence
Every frontier model you use at scale today is not the model you think you're using. When you send a prompt to a system running GPT-4, Gemini 1.5 Pro, or Mixtral 8x22B, the computation that produces your response does not flow through the entire model. It flows through a selected fraction—a routing decision made thousands of times per token, dispatching each piece of your input to a small subset of specialized sub-networks while the rest of the model sits idle. This is Mixture of Experts (MoE), and calling it an optimization technique fundamentally misunderstands what it is.
The core claim of this chapter is architectural and philosophical: MoE embodies a different theory of model capacity than the dense transformer. Where the dense model says every problem should be processed by the same universal machinery, MoE says different problems require different competencies, and spending full-model capacity on every token is incoherent. That shift—from uniformity to specialization, from full activation to conditional routing—has consequences extending far beyond performance benchmarks. It changes how you think about inference cost, how you interpret capability claims, and how regulatory frameworks based on compute metrics are systematically failing to account for the architecture of the systems they attempt to govern.
The strongest objection to this framing is that MoE is still a transformer, still trained by gradient descent on the same objectives, and still producing the same kinds of outputs—so calling it a "different theory" is overreach. That objection deserves a direct answer, and this chapter addresses it. First, though, you need to understand what is happening inside these models when they run, because the mechanism makes the philosophy concrete.
Dense Transformers and the Cost of Uniformity
To understand why MoE exists, you need to feel the weight of the alternative. A standard dense transformer—the architecture underlying GPT-3, the original LLaMA (Large Language Model Meta AI), Mistral 7B, and most of the open models that proliferated from 2020 to 2023—applies every parameter in every layer to every token in every forward pass. No exceptions. The token representing the word "the" at the start of a legal contract activates the same 7 billion, or 70 billion, or 175 billion parameters as the token representing a differential equation in a physics derivation. The model has no mechanism to say: this is a syntactically simple token; I need less machinery here.
This uniformity has a certain mathematical elegance. The model's representational capacity is always fully engaged, and gradient signal flows through all parameters on every step. Training is predictable. Inference is predictable. Memory is predictable. But uniformity becomes a liability at scale. The Chinchilla paper—a landmark 2022 DeepMind study on optimal compute allocation—established that the optimal strategy for a given compute budget involves training smaller models on more data rather than larger models on less. The underlying reason is that adding parameters to a dense model adds them everywhere, to all paths, for all tokens, regardless of whether that capacity is ever needed. You cannot concentrate capacity on the hard cases.
The practical consequence is a brutal tradeoff at the scale of frontier systems. The real challenge at scale is inference: the goal is decoupling training compute from inference compute. A dense model with 1 trillion parameters costs approximately 2 trillion FLOPs (floating-point operations, the standard unit of AI compute) per token generated—because inference in a dense transformer requires roughly two multiplications per parameter per forward pass. Run that model at OpenAI's scale, serving hundreds of millions of users, and the inference cost overwhelms any business model that isn't charging enterprise prices. A trillion-parameter dense model cannot achieve adequate inference throughput on even the newest H100 GPU servers due to memory bandwidth requirements. The model's parameters must be loaded from GPU memory for every single token generated, and the bandwidth of that memory transfer is the binding constraint, not the raw compute. Dense scaling hits a wall that isn't about training—it's about serving.
MoE is an answer to this wall. If you can route each token to only a small subset of parameters, and if that routing is intelligent enough that the right parameters handle the right tokens, then you can have a model with enormous total parameter count—and therefore enormous capacity for rare or difficult inputs—while spending only a fraction of that total on any given token. The active parameter count, not the total parameter count, governs inference cost. This is the central accounting identity of MoE.
How Routing Works: The Gating Network's Decision
Inside a standard transformer, each layer contains two main components: a self-attention mechanism that allows tokens to communicate with each other, and a feed-forward network (FFN) that processes each token independently after the attention step. The MoE transformation leaves attention largely unchanged. It replaces the single FFN in each layer with a set of N FFN "experts" plus a learned routing function, called a gating network, that decides which experts to activate for each token.
The gating network is conceptually simple but consequential in practice. For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. In the standard top-k formulation—with k=2, as in Mixtral—the router takes the token's hidden state h as input and computes a score vector g = Softmax(W_r · h), where W_r is a learned projection matrix of shape [d_model × N_experts]. The router selects the two highest-scoring experts i₁ and i₂, routes the token to those experts, and produces the output y = g_{i₁} · FFN_{i₁}(h) + g_{i₂} · FFN_{i₂}(h), where g_{i₁} and g_{i₂} are the normalized softmax weights for those two selected experts. Critically, the softmax is computed over all N experts, but only the top-2 scores survive into the output; the remaining scores are discarded along with their corresponding expert computations. The resulting combined representation continues to the next layer. The other six experts, in an eight-expert model, receive nothing from that token on that pass.
Even though each token only sees two experts, the selected experts can be different at each layer and position. Across a sequence of tokens, across a document, across a dialogue, different tokens route to different combinations of experts. The model develops a dynamic activation pattern where the computation graph is literally different for each token—not different in value, but different in which weights participate at all.
Within a Mixtral-style transformer block, MoE simply replaces the single feed-forward layer with a set of N feed-forward layers and adds a router in front of them. The attention mechanism, layer normalization, and embeddings are all shared and dense. Only the FFN component is sparsified. MoE is therefore not a radical departure from the transformer; it is a targeted modification at precisely the component most responsible for storing factual knowledge and performing position-wise transformations. Attention handles routing information across the sequence. The FFN in a dense model is where the model retrieves and applies learned patterns. MoE turns that single monolithic retrieval into a conditional lookup: given this token in this context, which specialized retrieval mechanisms are most relevant?
The k=2 choice is not arbitrary. Using a single expert (k=1), as in Google's Switch Transformer, gives maximum sparsity but sacrifices stability—the model becomes sensitive to routing errors because there is no fallback. Using k=2 provides redundancy while retaining most of the compute savings. Using k=4 or higher starts to converge back toward dense behavior, where most experts are active for most tokens. The k value is where the core tradeoff—scaling model capacity linearly while scaling computation sub-linearly—gets set.
An alternative to top-k routing worth understanding is Expert Choice routing, introduced by Zhou et al. (2022), which inverts the routing direction: instead of each token selecting its top-k experts, each expert selects its top-k tokens from the batch. This eliminates the load imbalance problem at its source—every expert processes exactly the same number of tokens by construction—but introduces a different complication: some tokens may be selected by zero experts and receive no FFN processing in that layer. Expert Choice routing is therefore better suited to encoder architectures where dropped tokens can be tolerated, and is less natural for autoregressive generation where every token must produce an output at every layer. The choice between token-chooses-expert and expert-chooses-token is one of the first architectural decisions an MoE designer must make, and it shapes every downstream training and serving consideration.
What Experts Actually Learn: The Specialization Question
The intuitive appeal of MoE is the story of expert specialization: one expert handles code, another handles mathematics, another handles languages, and the router has learned to dispatch tokens to the appropriate specialist. This story is clean and compelling. It is also incomplete in ways that matter for understanding what these models are doing.
The empirical picture is more granular and more interesting than the domain-specialist narrative. While some work suggests experts specialize in broad domains, other work suggests experts respond primarily to token-level or syntactic features. Both views are incomplete. Experts often behave as fine-grained task specialists. An expert may be domain-restricted—say, LaTeX—but its role is better described as a concrete computational operation, such as closing brackets in LaTeX, rather than representing the domain as a whole. The unit of specialization is not "biology expert" or "code expert." It is something more like "expert at performing this class of transformation on this class of token pattern."
This distinction changes the model of what routing is doing. The router is not running a domain classifier and forwarding tokens to the relevant subject-matter expert. It is running a pattern-matching operation against the hidden state—a continuous vector encoding not just the token's identity but its position in the sentence, its grammatical role, its semantic context—and selecting the processing pathways most suited to the transformation that token needs to undergo in this context. Empirical analysis across models and layers suggests that early experts bind morphology, mid-layer experts stabilize syntax, and deeper experts retrieve domain knowledge. The specialization is hierarchical and functional, not categorical.
The Mixtral paper examined this directly, studying expert selection patterns across different subsets of The Pile validation dataset (a large open-source corpus used for training and evaluating language models). The results were telling: at the first and last layers of the model, routing distributions across domains were relatively similar—not flat, but not cleanly partitioned. Domain differentiation was more pronounced in middle layers, and even there it was statistical rather than exclusive. No expert was monopolized by any single domain. What the analysis did reveal, however, was consistent positional regularity: tokens at syntactically predictable positions—sentence-initial tokens, punctuation, closing delimiters—routed to a narrow, stable set of experts across domains, while semantically loaded tokens in the middle of phrases showed the most routing diversity. This suggests the router has learned that structural positions require consistent processing regardless of topic, while content words require domain-sensitive processing.
MoE experts' neurons are consistently less polysemantic than dense FFNs, and monosemanticity increases as the degree of sparse routing increases. Dense FFN neurons are famously polysemantic—they respond to multiple unrelated concepts, because the model needs to pack many features into a limited parameter budget. MoE's enforced specialization produces neurons that are less overloaded. The routing pressure—the need to route each token to its two best experts rather than always using the same weights—induces a division of computational labor that results in cleaner internal representations. This is one mechanism by which MoE achieves quality above what its active parameter count would predict: the parameters it activates are doing more focused, less conflicted work.
For organizations thinking about AI audit and interpretability, this is an architecturally significant property. MoE models have a natural decomposition that dense models lack—the expert routing structure provides a handle on which components of the model were responsible for a given output, offering a clearer path toward large-scale model interpretability.
Parameter Accounting: The Number That Matters
Mixtral 8x7B is the canonical public case study for MoE parameter accounting, and the numbers are worth internalizing precisely. The name tells you the shape: 8 experts, each with 7 billion parameters. But it is not 56 billion parameters total, because the attention layers and embedding layers are shared across all experts. The total is approximately 47 billion parameters. Each token has access to 47B parameters but uses only 13B active parameters during inference.
That gap—47B total versus 13B active—is the central fact of MoE economics. Mixtral outperforms or matches Llama 2 70B on almost all popular benchmarks while using 5x fewer active parameters during inference. With only 19% of the FLOPs needed per token (13B versus 70B), this translates to up to 5x faster training or inference times, excluding routing overhead. The gap between these numbers—performance parity with 70B, cost profile closer to 13B—is the entire economic case for MoE at the frontier.
The follow-on model, Mixtral 8x22B, extends this logic further: 141B total parameters, but only 39B active. The active parameter count governs latency and throughput; the total parameter count governs memory—every expert must be loaded into GPU memory to be available for routing, even if it processes no tokens on a given request. Memory requirements scale with total parameters; compute requirements scale with active parameters. The two costs decouple. You need large, expensive GPU clusters to host the model—but once hosted, inference is fast. A concrete illustration: running Mixtral 8x22B in fp16 requires loading all 141B parameters into GPU memory, demanding roughly 280GB of GPU VRAM, which requires at minimum four A100-80GB GPUs just to hold the weights. But once loaded, generating each token costs roughly the same FLOPs as a dedicated 39B dense model—meaning throughput per dollar on hosted inference is competitive with models less than a third its total parameter size.
For anyone managing AI infrastructure or vendor contracts, this is the accounting to internalize. When a vendor quotes inference pricing, the relevant technical basis is active parameters and the resulting FLOPs per token—not total parameters, which is a marketing metric. When a benchmark report shows Mixtral 8x22B achieving strong MMLU (Massive Multitask Language Understanding, a standard AI benchmark) scores, the relevant cost comparison is against models with roughly 39B active parameters, not models with 141B total parameters.
The implications sharpen further when you consider the GPT-4 case, which is architecturally unconfirmed but consistent with what has been reported. GPT-4 reportedly has approximately 1.8 trillion parameters across 120 layers, over 10 times larger than GPT-3. OpenAI reportedly uses 16 experts within the model, each with approximately 111B parameters for the MLP (multi-layer perceptron, the feed-forward sub-network) component, with two experts routed per forward pass. These figures come from SemiAnalysis's widely-cited architectural reconstruction—a technical newsletter that reverse-engineers AI infrastructure—not from OpenAI directly; OpenAI has not publicly commented on GPT-4's technical specifications, but the model is widely believed to deploy a Mixture of Experts architecture.
In each forward propagation, GPT-4 reportedly uses only about 280 billion parameters and 560 TFLOPs (teraFLOPs, meaning trillions of floating-point operations), whereas a pure dense equivalent would require approximately 3,700 TFLOPs. The active/total split enables OpenAI to serve a model with enormous total capacity at per-token costs that would be impossible with a dense architecture of equivalent knowledge.
The inference routing problem introduces its own complications at GPT-4's scale. With 16 experts and 2 active per forward pass, if the batch size is 8, the parameter read for each expert may only be a batch size of 1. One expert may have a batch size of 8, while others may have 4, 1, or 0. Load imbalance at inference time—where some experts are hot and others cold—creates throughput variability that requires careful infrastructure management. OpenAI's choice of 16 experts rather than a larger number was deliberate: more experts are harder to generalize across many tasks and harder to converge during training, so 16 represents a conservative balance.
Expert Collapse and the Load Balancing Problem
There is a failure mode lurking in MoE's elegance that, if unaddressed, makes the architecture nearly useless. It is called expert collapse, and understanding it clarifies why MoE systems require careful training dynamics that dense models do not.
The mechanism is simple. The router is a learned function trained end-to-end with the rest of the model. If one expert happens to be slightly better than others early in training—by chance, by initialization—the router will route more tokens to it. That expert then receives more gradient signal, improves faster, becomes even better relative to the others, and draws even more routing attention. The other experts, starved of signal, fail to develop. The model converges to a state where one or two experts process nearly all tokens while the rest are essentially unused. You have paid the memory cost of N experts and received the compute benefits of one.
The routing mechanism, trained end-to-end with the task objective, develops a strong bias toward high-performing experts, creating a feedback loop that exacerbates load imbalance. The standard solution is an auxiliary loss added during training—a term in the loss function that penalizes routing imbalance. The most common formulation, introduced by Shazeer et al. in the original 2017 Sparsely-Gated MoE paper and refined in Fedus et al.'s 2022 Switch Transformer, penalizes the product of the fraction of tokens dispatched to each expert and the mean routing probability assigned to each expert, summed across experts. Formally, the load balancing loss takes the form L_balance = α · N · Σᵢ fᵢ · Pᵢ, where fᵢ is the fraction of tokens routed to expert i, Pᵢ is the mean softmax probability assigned to expert i across the batch, N is the number of experts, and α is a scalar coefficient that controls the strength of the penalty. The Mixtral paper uses α = 0.02; the Switch Transformer paper found values between 0.01 and 0.05 to be effective across different model scales. Both are intentionally small relative to the primary cross-entropy loss—large enough to prevent collapse, small enough not to override the task objective.
But this creates a fundamental contradiction: while the core principle of MoE emphasizes specialized experts through conditional computation, load balancing forces indiscriminate token distribution, inevitably undermining expert specialization.
This is the real tension at the heart of MoE training. You want routers to develop genuine specialization—sending the right tokens to the right experts. Naive routing produces collapse. The auxiliary loss prevents collapse by pushing toward uniformity. Too much uniformity pressure undoes specialization. The practitioner's task is calibrating the auxiliary loss coefficient carefully: strong enough to prevent collapse, weak enough to allow specialization to develop.
The DeepSeekMoE architecture addressed this tension by proposing a more fine-grained expert decomposition—splitting what would be standard experts into many smaller "shared experts" and "specialized experts," with the shared experts always active and the specialized ones selectively routed. Some computations are universally needed (handled by shared experts) while others are genuinely domain-specific (handled by selective experts). By separating these roles architecturally rather than fighting over them through a single routing mechanism, training becomes more stable and specialization cleaner. DeepSeek-V3, released in late 2024, extends this to 256 routed experts per layer with only 8 active, alongside 1 always-active shared expert per layer—a configuration that pushes sparsity far beyond Mixtral's 2-of-8 while relying on the shared expert mechanism to maintain the baseline coherence that prevents collapse at such extreme routing ratios.
Fine-tuning MoE models is harder than fine-tuning dense ones. MoE models are prone to overfitting on small datasets, partly because the routing patterns can shift when you update weights on limited data. For organizations considering fine-tuning a commercial MoE-based model for a specialized task, the routing dynamics mean you need more data, more careful regularization, and more evaluation to confirm that specialization has been preserved rather than disrupted.
Gemini 1.5 Pro and the Long-Context Case
The MoE compute tradeoffs become especially clear in the case of Gemini 1.5 Pro, where the architectural choice directly enabled a capability jump that would have been economically impossible with a dense architecture.
Gemini 1.5 Pro incorporates a MoE architecture alongside major advances in training and serving infrastructure, enabling a push at the boundary of efficiency, reasoning, and long-context performance. The long-context achievement is the headline: Gemini 1.5 Pro can recall and reason over fine-grained information from up to at least 10 million tokens, enabling the processing of long-form mixed-modality inputs including entire collections of documents, multiple hours of video, and almost five days of audio.
Why does MoE matter for long context? Processing a one-million-token context with a dense transformer requires running every parameter in every layer across all one million token positions—memory and compute requirements scale directly with context length. With MoE, the active parameter count per token remains constant regardless of context length. The number of experts does not grow with the sequence; routing decisions are made token by token. Total model parameters can grow while the number of parameters activated per token stays constant. The economics of long-context inference become far more tractable: you pay the routing overhead once per token, and the per-token compute cost does not increase just because the sequence is long. The attention mechanism does still scale quadratically with sequence length in the naive case, which is why Gemini 1.5 also incorporates efficient attention approximations—but the FFN cost, which is the dominant parameter-count contributor in large models, remains flat per token regardless of context window, and that is precisely the component MoE sparsifies.
Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks—greater than 99% up to at least 10 million tokens—a generational leap over models such as Claude 2.1 (200k token context) and GPT-4 Turbo (128k token context) available at the time.
Different positions in a long document may need different processing pathways. MoE accommodates this naturally; a dense model has no mechanism to allocate different computational resources to different parts of the context. The connection between MoE and context scaling points toward a design principle the frontier labs have internalized: when the challenge is maintaining quality across very long sequences rather than raw per-token reasoning, MoE provides the architectural flexibility to grow total model capacity without proportionally growing inference cost.
Google has been an early adopter and pioneer of MoE through research including the Sparsely-Gated MoE paper, the GShard distributed training framework, Switch Transformer, and M4 (a multilingual MoE model). Gemini 1.5's MoE architecture is a deployment of that research tradition at production scale, with long-context capability as proof of the approach's payoff.
The Governance Problem: When FLOP Thresholds Break Down
Understanding MoE architecture is not merely a technical exercise for system architects. It has direct implications for how AI policy instruments work—or fail to work.
The dominant governance mechanism for frontier AI models is compute thresholds expressed in floating-point operations. Under the EU AI Act, general-purpose AI (GPAI) models present systemic risks when the cumulative amount of compute used for training exceeds 10²⁵ FLOPs. President Biden's Executive Order 14110 specified 10²⁶ FLOPs as the threshold for triggering certain reporting obligations and being deemed a dual-use foundation model capable of malicious cyber-enabled activity. These thresholds use total training compute as a proxy for model capability—more compute implies more capability, implies more risk.
MoE breaks this proxy in both directions simultaneously.
Consider a hypothetical MoE model trained with 10²⁶ FLOPs. A dense model trained with the same compute budget would have dramatically fewer total parameters. The MoE model, because it activates only a fraction of its parameters per token, could be trained to process far more tokens for the same budget, developing broader knowledge and handling more domains. The emergence of models like DeepSeek R1 challenges the core assumption behind compute-based regulation: through architectural innovation, DeepSeek achieves performance comparable to much larger models while using significantly less compute, undermining the notion that FLOP thresholds reliably indicate a system's potential impact.
Now consider a dense model and an MoE model with equivalent active parameter counts per forward pass. The MoE model may have trained on the same or fewer FLOPs—because each training step also activates only a fraction of parameters—but its total parameter count and total knowledge capacity is far larger. A regulation looking only at training FLOPs might classify the MoE model as less risky than the dense model, even if the MoE model draws on a much larger knowledge base.
To make this concrete: DeepSeek-V3 reports approximately 2.79 × 10²⁴ FLOPs for its full pre-training run—well below both the EU AI Act's 10²⁵ threshold and the Biden Executive Order's 10²⁶ threshold—yet the model achieves benchmark performance competitive with GPT-4-class systems and has 671B total parameters with 37B active. Under both regulatory frameworks, DeepSeek-V3 would be classified as a lower-risk system than a hypothetical dense model trained with 10²⁵ FLOPs, even though the MoE model has roughly ten times more total parameters and competitive capability on the tasks those thresholds were designed to gatekeep. The classification error is not marginal—it is structural.
Current governance frameworks refer to the amount of compute used during the final pre-training run and do not include post-training enhancements such as fine-tuning. Post-training RLHF (reinforcement learning from human feedback) and instruction tuning can dramatically alter the behavior of an MoE model—potentially activating specific expert pathways that were dormant or underweighted during pretraining. A model below a training-compute threshold could be post-trained into significantly different capability territory. The threshold-based system has no mechanism to detect this.
Model distillation—a process in which a larger teacher model trains a smaller student model while sacrificing only modest performance—compounds the problem further. MoE makes the distinction between "training compute" and "inference compute" actively misleading as a policy instrument. You can have a model with 10²⁶ training FLOPs that runs at 13B-active-parameter inference cost. You can have a model with 10²⁵ training FLOPs that has vastly more total parameters because of sparse training. The threshold framework conflates these entirely different situations.
The EU AI Act acknowledges this limitation, stating that the Commission shall adopt delegated acts to amend the compute thresholds in light of evolving technological developments such as algorithmic improvements or increased hardware efficiency. That is a provision for updating thresholds. It does not solve the architectural ambiguity. The question of which compute metric to measure—training FLOPs, active inference FLOPs, total parameter FLOPs, weighted by routing distribution—is not resolved by threshold adjustment. It requires a more sophisticated conceptual framework for what compute means in a conditional computation architecture.
For the CAIO (Chief AI Officer) or policy professional advising on AI governance, the practical implication is this: any regulatory instrument that uses a single compute number to characterize a model should be interrogated for whether that number is training compute, active inference compute, or total parameter compute. For dense models, these three quantities are closely correlated. For MoE models, they can differ by an order of magnitude. A system that looks small by active inference cost can have the knowledge base of a much larger system. A system that looks large by total parameters can be served at costs that make it economically trivial to deploy at massive scale. Governance frameworks that do not distinguish between these quantities will systematically misclassify MoE models—sometimes in the direction of over-regulation, more often in the direction of under-regulation.
The Design Philosophy and Its Permanence
The objection raised at the outset—that MoE is still just a transformer variant, that calling it a "different theory" is overreach—deserves its full answer here.
The objection is technically correct and strategically misleading. Yes, MoE models use self-attention. Yes, they are trained by gradient descent. Yes, they produce the same token prediction outputs as dense models. The design philosophy embedded in MoE's architecture, however—the deliberate choice to make capacity conditional on the input rather than universal—is not a detail. It is a statement about where intelligence comes from.
The dense transformer's implicit theory is that intelligence is a function of having sufficiently large universal machinery: build enough parameters, train them on enough data, and the emergent representations will handle any input. The MoE theory is different: intelligence requires having the right specialized machinery available and routing effectively to it. These are not the same claim. The empirical evidence increasingly favors the MoE view. Mixtral 8×7B outperforms dense models far above its inference budget—each token uses only roughly 13B parameters, yet Mixtral surpassed Meta's 70B dense LLaMA-2 on tasks such as math, code generation, and multilingual understanding. Routing to specialized sub-networks, even without discrete domain-level experts, produces better performance than uniform processing at equivalent cost.
That the specialization is functional rather than semantic—that experts are specialists in computational operations, not subject matter—does not weaken this claim. It strengthens it. The model has discovered, through training, that the right decomposition of language processing is not into "code" and "math" and "biology" but into finer-grained computational roles that cut across domains. The routing network has learned something real about the structure of language processing by exploiting it to allocate capacity efficiently.
The core principle—conditional computation, scaling model capacity linearly while scaling computation sub-linearly—is not going away. Every major frontier lab's architecture that has been reported or confirmed uses it. GPT-4, Gemini 1.5, Mixtral's entire family, DeepSeek-V3, and their successors all rely on the same fundamental insight: most tokens do not need the whole model, and a routing mechanism sophisticated enough to exploit this fact is worth the engineering complexity.
The implication you should carry forward is not merely that "MoE is efficient." The models you are governing, deploying, and building policy around are conditional computation systems whose effective capability depends on which experts are activated, which inputs trigger which routing patterns, and whether your evaluation scenarios are representative of the routing distributions that will arise in actual deployment. A model that scores at the 90th percentile on MMLU may be routing through one set of expert pathways; the same model processing code generation tasks in production may be routing through an entirely different set. Those pathways are specialized—even at the functional rather than semantic level—which means capability profiles are input-dependent in ways that single-number benchmarks do not capture.
When your organization's risk assessment asks "what can this model do?", the architecturally honest answer is: it depends which experts you're activating, and your benchmark suite may not be activating the ones that matter for your deployment. Designing evaluations that deliberately probe diverse input distributions—not just the canonical academic benchmarks—is, for MoE-based systems, not a methodological nicety but a technical necessity for any assessment that aspires to be accurate.