M11E1: Benchmarks, Saturation, and Evaluation Incommensurability

Module 11, Episode 1: Benchmarks, Saturation, and Evaluation Incommensurability

The Measurement Crisis as a Capability Signal

The benchmarks that shaped five years of AI investment decisions, regulatory proposals, and executive strategy are no longer measuring what matters. That is the blunt claim, and it has a precise corollary: the decision-makers who are still reading MMLU (Massive Multitask Language Understanding, a standard AI benchmark) and HumanEval scores as capability signals are systematically misreading the frontier in ways that produce wrong conclusions about deployment risk, competitive positioning, and the actual distance between current models and the tasks those models are being asked to perform.

The core issue is a capability signal, not a measurement failure. A benchmark saturates when frontier models all score above approximately 90%, making it unable to discriminate between them. Saturation is a success story: the field has genuinely mastered the capability the benchmark was designed to measure. The confusion arises in what comes next. The problem is continued citation of saturated benchmarks as if they still provide meaningful signal. When that confusion enters investment theses, regulatory impact assessments, and product roadmaps, the consequences are not abstract.

This confusion is structural, not incidental. The benchmarks that saturated were designed by researchers to measure what was hard for models in 2020 and 2021. They were adopted by journalists, analysts, and policymakers because they produced legible numbers. The legibility outlasted the validity. And the replacements — GPQA Diamond (Graduate-Level Google-Proof Q&A, a benchmark of expert-designed questions resistant to web retrieval), Humanity's Last Exam, SWE-bench (a software engineering benchmark that tests real GitHub issue resolution) — reveal a capability profile that is simultaneously more impressive and more limited than the saturated benchmarks suggested, in ways that matter enormously for anyone responsible for governing or deploying these systems.

What MMLU and HumanEval Actually Measured, and Why They Stopped Working

MMLU, introduced by Hendrycks et al. in 2020, was a genuine contribution. It assembled 57 academic subjects into a 16,000-question multiple-choice battery spanning elementary mathematics, professional law, medical licensing content, and everything between. For 2020 and 2021 models, it was hard. GPT-3 didn't reliably beat random chance on the harder subsets. The benchmark provided meaningful gradient: it could distinguish models that had developed broad knowledge representations from those that hadn't.

The gradient disappeared. By the time frontier discourse matured in 2023, GPT-4 had scored 86.4%. GPT-4o scored 88.7% in 2024, and GPT-4.1 scored 90.2% in 2025. In the last year, AI companies have stopped reporting MMLU scores — presumably because scores have stopped improving. The ceiling is not 100%. Systematic manual review of 5,700 questions revealed a 6.49% error rate, with categories including parsing mistakes, multiple correct answers, and missing context, which means the hard ceiling sits at roughly 93–94%, baked in by flawed questions in the dataset itself.

The implications of that error rate compound when you understand what it means for model comparison. Since 2024, frontier models have all scored between 88% and 93% — a range narrow enough that differences could be random noise. According to IBM Research's NeurIPS 2024 paper, MMLU demonstrates 4–5% sensitivity in model scores to prompt variations alone. When the score spread between leading models is 2–4 percentage points and prompt sensitivity can swing results by 4–5 points, the comparison is measuring methodological noise, not capability. GPT-4o demonstrated a 13 percentage point variance in MMLU-Pro scores across different measurement sources. That is a data point about what any benchmark score means when the reporting methodology isn't controlled.

HumanEval, the coding benchmark from OpenAI's 2021 Codex paper, followed an identical trajectory. It consists of 164 Python programming tasks, each graded by running unit tests against generated code. When introduced, it established that models could generate functionally correct code at all — a non-trivial result at the time. Both MMLU and HumanEval saturated in 2024. The saturation on HumanEval was arguably more corrosive because code generation benchmarks had become the primary currency for enterprise AI purchasing decisions. A CISO (Chief Information Security Officer) reading that Model A scored 87% on HumanEval and Model B scored 84% was reading, effectively, nothing. Compounding this, HumanEval's 164-task dataset is small enough that individual problem variance dominates statistical comparison — a model's score can shift by 2–3 percentage points based solely on which random seed initializes the generation process, a fact rarely disclosed in vendor reporting.

The replacement of MMLU with MMLU-Pro — which expands from four answer options to ten, adds chain-of-thought requirements, and draws from graduate-level content across 12,000 questions — bought some time. Recent leaderboard results show top models, including Google's Gemini 3 Pro at approximately 90.1%, Anthropic's Claude Opus 4.5 with reasoning at approximately 89.5%, and DeepSeek-V3.2 at approximately 85.0%, approaching 90% accuracy on MMLU-Pro, suggesting the benchmark itself may be approaching saturation for frontier models. The same lifecycle is playing out, compressed. Even though chain-of-thought reasoning has significantly improved LLM performance, these systems still cannot reliably solve problems for which provably correct solutions can be found using logical reasoning — a gap with direct implications for high-risk deployment contexts.

The deeper problem with MMLU was never just difficulty. Academic knowledge recall, even at graduate level, is not the capability profile that matters for AI deployment decisions: complex reasoning under novel conditions, sustained multi-step problem-solving, and the ability to act correctly in contexts that don't resemble anything in training data. The new generation of benchmarks was designed to get at those things directly.

The Hard Benchmarks: What GPQA Diamond and HLE Reveal

GPQA Diamond was introduced in November 2023 by David Rein and colleagues at NYU and Anthropic. The design philosophy was explicitly adversarial against the kind of benchmark gaming that had rendered MMLU meaningless. GPQA is a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Experts who hold or are pursuing PhDs in the relevant domains reach 65% accuracy (74% when discounting clear mistakes the experts themselves identified in retrospect), while highly skilled non-expert validators reach only 34%, despite spending on average over 30 minutes with unrestricted web access. The questions are designed to be Google-proof.

That design decision — making questions resistant to retrieval rather than just difficult — was the key structural innovation. A model that memorized everything in its training corpus still can't search its way to the answer on a GPQA Diamond question, because the question was specifically built to resist that strategy. The Diamond subset sharpens this further: it contains 198 questions for which both domain expert annotators got the correct answers, but which the majority of non-domain experts answered incorrectly. A concrete example of the design philosophy: rather than asking which enzyme catalyzes a known reaction — answerable by retrieval — GPQA Diamond questions present novel experimental setups where correct reasoning requires synthesizing multiple mechanistic principles that don't appear in combination anywhere in published literature.

When OpenAI's o1 was evaluated against this benchmark in September 2024, the result marked a genuine threshold crossing. OpenAI recruited PhD-level experts to answer questions in GPQA Diamond and found they scored 69.7%. O1 exceeded that — the first model to do so on this benchmark. O3 subsequently reached 87.7% on GPQA Diamond, and the progression continued. By early 2026, Google Gemini 3.1 Pro Preview pushed the frontier to approximately 94.1%, with GPT-5.2 at approximately 92.4%, Gemini 3 Pro at approximately 91.9%, and Claude Opus 4.6 at approximately 91.3%, all clustered closely.

Two things are simultaneously true about those numbers. First, they represent extraordinary capability: human PhD experts score around 69.7% on GPQA Diamond, and scores above 90% indicate superhuman scientific reasoning on these structured problems. Any organization deploying AI in scientific research, drug discovery, or advanced engineering should treat this as a real capability signal. Second, the benchmark is already approaching the same compression problem that killed MMLU. GPQA Diamond, which had substantial headroom in 2024, now sits at 78.4% state-of-the-art with all frontier models between 71–78%. At current rates, it will saturate within 18–24 months.

Humanity's Last Exam, or HLE, was introduced in January 2025 by the Center for AI Safety and Scale AI as an explicit response to the saturation dynamic. HLE is a multi-modal benchmark at the frontier of human knowledge, consisting of 2,500 questions across dozens of subjects including mathematics, humanities, and the natural sciences. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered via internet retrieval. During curation, submitted questions were first filtered by frontier LLMs — only questions that stumped those models advanced through two subsequent rounds of human expert review before final inclusion. Over 70,000 submissions were received; 13,000 passed the difficulty bar and reached human review. The result is a benchmark where failure is the expected baseline for frontier models.

HLE launched with a top system score of just 8.80%. Scores have risen dramatically since: as of the most recent leaderboard data, Gemini 3.1 Pro Preview leads at 44.7%, followed by GPT-5.5 at 44.3%. The published HLE paper in Nature noted that given the rapid pace of AI development, models could plausibly exceed 50% accuracy by the end of 2025. The subject distribution of HLE failures is itself informative: models perform weakest on questions requiring multi-step quantitative reasoning embedded in domain-specific context — for instance, questions in theoretical physics that require both correct physical intuition and non-trivial mathematical derivation — rather than on pure recall or single-domain questions, which points to a specific architectural limitation rather than a general knowledge gap.

Full saturation of HLE would mean models can answer structured expert-knowledge questions at expert level. It would not mean models can do science. The gap between those two things — between Q&A performance and autonomous research capability — is precisely the gap the benchmarks cannot currently measure, and that gap is where most of the consequential deployment decisions are being made.

SWE-bench: When Evaluation Becomes Agentic

SWE-bench, introduced by Carlos Jimenez, John Yang, and colleagues at Princeton and Stanford at ICLR 2024 (the International Conference on Learning Representations), represented a qualitative shift in benchmark design philosophy. Rather than asking models questions — even extremely hard ones — it asked them to do a job. Models are tasked to resolve issues, typically a bug report or feature request, submitted to popular GitHub repositories. Each task requires generating a patch describing changes to apply to the existing codebase. The revised codebase is then evaluated using the repository's testing framework.

The original dataset contains 2,294 tasks drawn from 12 open-source Python repositories including Django, scikit-learn, Flask, and Sphinx. The evaluation criterion is binary and unambiguous: does the generated patch make the failing tests pass without breaking the passing tests? Either the model fixed the bug or it didn't.

When SWE-bench launched, the best systems solved around 4% of tasks. By late 2024, that number had climbed past 50%, prompting two important follow-ups: SWE-bench Verified and SWE-bench Pro. SWE-bench Verified, released by OpenAI in August 2024, is a human-validated section in which each task has been carefully reviewed and validated by human experts, producing a curated set of 500 high-quality test cases from the original benchmark.

Scores on SWE-bench Verified have continued their steep ascent, with current leaderboard results showing 80–93% for top systems. Those numbers require careful parsing, because the contamination and methodology problems that plagued MMLU are present here too — in different form. The benchmark is composed of scrapes of public GitHub repositories, meaning large foundation models pre-trained on internet text are likely contaminated on the tasks. Research has found that a disproportionate fraction of issues were created well before LLM knowledge cutoffs, and direct solution leakage — copying solution code from issue text or comments — persists in 30% or more of successful passes without further filtering. Diagnostic subtasks demonstrate that LLMs can achieve up to 76% accuracy by memorization, not reasoning.

OpenAI found that 59.4% of the 138 problems in an internal sample contained material issues in test design or problem description, rendering them extremely difficult or impossible even for the most capable model or human to solve. That finding led OpenAI to formally announce it had stopped evaluating against SWE-bench Verified as a frontier signal — a leading AI company publicly walking away from the benchmark it had helped validate eighteen months earlier.

The response has been SWE-bench Pro, developed by Scale AI. Frontier models were run on SWE-bench Pro using the SWE-Agent scaffold, and the results were stark: a significant drop in performance for all models when moving from SWE-bench Verified to the more challenging SWE-bench Pro. While most top models score over 70% on the verified version, the best-performing models — OpenAI GPT-5 and Claude Opus 4.1 — score only 23.3% and 23.1% respectively on SWE-bench Pro. The contamination resistance comes partly from deliberate curation strategy: SWE-bench Pro includes a commercial set using 18 repositories, reducing contamination risks by leveraging both legal protections and restricted data access — the first systematic application of such methodology in the research community.

The private subset of SWE-bench Pro reveals a further drop: Claude Opus 4.1 decreases from 22.7% to 17.8% resolution, and OpenAI GPT-5 falls from 23.1% to 14.9%. Evaluation on private, previously unseen codebases provides a more realistic measure of generalization.

That collapse — from 70%-plus on Verified to 14.9% on private, unseen code — is not a minor adjustment. A substantial portion of what looked like software engineering capability was sophisticated pattern-matching on previously encountered code. The "71% on SWE-bench" claim that circulated through enterprise sales decks in 2025 was measuring something categorically different from what enterprise buyers thought they were buying. SWE-bench Pro also introduced longer-horizon tasks — issues requiring coordinated changes across multiple files and modules rather than single-function patches — and the performance gap between these multi-file tasks and single-file tasks is itself a useful signal: current models' resolution rates on multi-file issues run roughly 40% lower than on comparable single-file issues, indicating that context management across a large codebase remains a genuine bottleneck even for the strongest systems.

ARC-AGI and the Test-Time Compute Problem

The most technically thorny issue in current benchmark interpretation is not saturation or contamination. It's incommensurability. Two models can report scores on the same benchmark that are not, in any meaningful sense, comparable numbers. Understanding why requires a precise account of test-time compute.

Standard benchmark evaluation assumes each model generates one answer per question. This is called pass@1: the fraction of problems solved on a single attempt. Contemporary reasoning models don't work that way. They generate extended chains of reasoning, sometimes exploring multiple solution paths, using considerably more compute per token of final output than standard generation. Pass@k — the probability of getting at least one correct answer in k samples — captures a different quantity entirely: how likely the model is to find the answer given multiple tries. Best-of-N sampling, where a model generates N candidate solutions and selects the best, and majority voting, where the most common answer across many samples is taken as the final answer, are inference-time strategies that can dramatically improve accuracy at proportionally increased cost.

Test-time compute dramatically improved performance on mathematical reasoning: o1 scored 74.4% on an International Mathematical Olympiad qualifying exam, compared to GPT-4o's 9.3%. O1 is nearly six times more expensive and 30 times slower than GPT-4o. That is not a minor implementation detail. A pass@1 score reported by o1 and a pass@1 score reported by GPT-4o represent fundamentally different quantities — different amounts of computation, different inference strategies, different cost structures — collapsed into a single number then reported as if they sit on the same axis.

ARC-AGI (the Abstraction and Reasoning Corpus for Artificial General Intelligence), designed by François Chollet, was specifically constructed to resist compute-scaling gaming. The benchmark consists of novel visual reasoning tasks: colored grid patterns that follow rules the model must infer from a small number of examples, then apply to a new input. The tasks are easy for humans — average adults score around 85% — and require exactly the kind of novel rule induction that memorization cannot help with. The priors built into each task's design are deliberately minimal and universal: the rules governing each puzzle are drawn from a constrained set of basic concepts — object persistence, symmetry, counting, simple transformations — that any adult human implicitly possesses, ensuring that benchmark performance reflects reasoning rather than domain knowledge.

The ARC-AGI-1 score that generated the most discussion about AGI in late 2024 and early 2025 demands precise interpretation. O3 achieved 87.5% on ARC-AGI-1, exceeding average human performance. O3 did not demonstrate general intelligence. This result came with high compute settings, applying extended test-time search to a benchmark explicitly designed to thwart such strategies.

The ARC Prize team's response was immediate: they released ARC-AGI-2. OpenAI's o3 scores 2.9% on ARC-AGI-2, compared to 60% for average humans. From 87.5% on ARC-AGI-1 to 2.9% on ARC-AGI-2: the same model family, the same reasoning architecture, a different set of novel rules to induce, and performance collapses by 97 percentage points. That is the signature of a system that improved at ARC-AGI-1 specifically, not at the underlying capability ARC-AGI-1 was designed to measure.

One finding with frontier commercial models released in late 2025 — Gemini 3 and Claude Opus 4.5 — is that refinement loops at the application layer can meaningfully improve task reliability independent of the provider's reasoning systems. A new Gemini 3 Pro refinement improves performance on ARC-AGI-2 from a baseline of 31% at $0.81 per task, up to 54% at $31 per task. That 23-point improvement costs roughly 38 times more compute per task. The benchmark score went up; the compute cost exploded. Whether the deployment is useful depends entirely on the economics of the specific application. A score divorced from its compute context is a number without a denominator.

The incommensurability problem now pervades the leaderboards. When o3 at high-compute settings and a standard inference model are both reported by pass@1 on the same benchmark, you are looking at costs that may differ by two orders of magnitude, inference times that may differ by 30-fold, and results that are simply not combinable into a ranking. Anyone reading a benchmark table today without asking "what inference strategy, at what compute cost, under what temperature setting" is reading a table missing its most important columns.

Better Benchmarks Alone Will Not Resolve This

The obvious objection to everything argued above is that the response is obvious: build harder benchmarks, build them faster, and the measurement problem resolves. The field is clearly trying. HLE was designed to last. GPQA Diamond was designed to be Google-proof. SWE-bench Pro was designed to be contamination-resistant. Doesn't the emergence of these better benchmarks vindicate the process?

The objection fails for two reasons, one structural and one strategic.

BIG-Bench Hard (a benchmark designed in 2022 specifically to challenge frontier models on tasks requiring multi-step reasoning) was hard for frontier models when introduced. By April 2026, state-of-the-art performance reaches 94.3% and the benchmark is approaching saturation within 12 months. GPQA Diamond will follow. HLE's creators built HLE-Rolling — a dynamically updated fork — precisely because their goal was to provide a seamless migration path for researchers once frontier models begin hitting the noise ceiling on the original HLE dataset. Every benchmark has an expected useful lifespan that is shrinking. The process of creating hard benchmarks and watching them saturate is not a solution — it is the phenomenon that needs explaining. The shrinking lifespan itself carries information: if the gap between a benchmark's introduction and its saturation compressed from roughly three years (MMLU) to roughly eighteen months (BIG-Bench Hard) to a projected twelve months or fewer for the current generation, the rate of capability advancement is itself a variable that benchmark-dependent governance frameworks are not designed to accommodate.

The strategic reason cuts deeper. All of these benchmarks, however hard, share a fundamental limitation: they are academic in structure. They ask questions that have single correct answers, graded automatically, on domains researchers can carefully scope. The tasks that matter most in actual AI deployment — sustaining a complex multi-step research workflow across an eight-hour session, writing production code that passes review from senior engineers without modification, providing accurate legal analysis that accounts for jurisdiction-specific nuance — are not the tasks any current benchmark cleanly evaluates.

A 2023 study showed that removing contaminated examples from the GSM8K test set (a benchmark of grade-school math word problems used to evaluate arithmetic reasoning) produced accuracy drops of up to 13% for some models, meaning a meaningful portion of high scores were driven by training set overlap rather than genuine mathematical reasoning. Strong benchmark performance, weaker generalization performance: this pattern is visible enough in research that it should be the prior assumption for any published score, until evidence argues against it in a specific case.

A model's published benchmark score predicts production performance only when three conditions hold: the benchmark tests tasks similar to your use case, the test set is clean of training data contamination, and the benchmark hasn't saturated to the point where score differences are statistically meaningless. Those three conditions are rarely all true simultaneously for any current benchmark relative to any specific enterprise use case.

What Benchmark Scores Should Tell You

The population of people who need to make accurate inferences from AI benchmark data in 2026 includes not just ML researchers but CAIOs (Chief Artificial Intelligence Officers), procurement teams, regulators, and policy professionals trying to determine whether a given system can safely and reliably perform a given function. The benchmarks discussed in this episode were designed for the first population. They are being used, often uncritically, by the second.

Treat general benchmarks as elimination criteria, not selection criteria. The benchmarks that provide real signal in April 2026 are SWE-bench Verified for software engineering tasks, measuring real GitHub issue resolution rather than synthetic code completion; GPQA Diamond for scientific reasoning at doctoral level; and AIME 2025 (the American Invitational Mathematics Examination, used here as a rigorous test of advanced mathematical reasoning) for mathematical reasoning. If a model scores significantly below the frontier on these, that's informative — it marks a capability ceiling likely to matter. When models cluster within the frontier band on all of them, as they increasingly do, general benchmarks can no longer tell you which model to deploy in your specific context.

No single model dominates every task. Claude Opus 4.6 leads on coding and nuanced writing, GPT-5.4 excels at structured reasoning and computer use, Gemini 3.1 Pro leads on abstract reasoning and scientific benchmarks, and new open-source entrants now rival frontier proprietary models on SWE-bench. This differentiation emerges only when you evaluate against the specific task distribution of your deployment context. No general leaderboard captures it.

The implication for organizations with AI governance responsibilities is direct: demand task-specific evaluations, not general leaderboard positions. If you are deploying AI for legal contract review, build or commission an evaluation set of contracts your organization processes, annotated by senior attorneys, with clearly defined success criteria. If you are deploying AI for biomedical literature synthesis, build an evaluation on the literature your scientists read. Before comparing two scores, verify that they used the same N-shot setting, the same chain-of-thought setting, the same temperature, the same max tokens, the same test set version, and the same success criterion. If any of these differ, the comparison is invalid. This methodology demand is not perfectionism — it is the minimum evidentiary standard that any serious procurement process in a regulated industry already applies to vendor claims in other domains, from financial risk models to clinical diagnostic software, and there is no principled reason AI capability claims should receive less scrutiny.

This is the minimum due diligence for a consequential deployment decision. And it requires understanding what the existing benchmarks can and cannot tell you — which is precisely what benchmark saturation reveals: not that models have stopped improving, but that the instruments built to measure improvement in 2020 and 2021 are no longer calibrated to the range where the actual action is happening.

The organizations still citing MMLU scores in their AI strategy documents are revealing that they haven't yet grappled with what the measurement shift means: the capability question and the deployment question have diverged. Frontier models are extraordinary on the tasks these benchmarks were designed to probe. The critical unknowns are the gaps — the specific production tasks these systems will face, the failure modes no academic benchmark was designed to reveal, and the reliability under distribution shift that won't appear until the system is operating at scale on your data. That is the evaluation challenge that matters, and it doesn't have a leaderboard.