M6E1: Test-Time Compute Scaling: The New Axis of AI Capability
Test-Time Compute Scaling: The New Axis of AI Capability
The Assumption That Ran the Field
Until September 2024, the operating premise of the entire AI industry was deceptively simple: a model's capability is determined at training time. You pour compute into pretraining, you run reinforcement learning from human feedback (RLHF) to align it, you deploy it, and what you get at inference is a fixed function of those upstream choices. More compute, more data, more parameters — that was the recipe. The Chinchilla paper (Hoffmann et al., 2022) formalized this intuition into scaling law, telling you exactly how to allocate training budget between model size and token count for optimal performance. Infrastructure teams built around it. Benchmark leaderboards assumed it. Every apples-to-apples comparison between models rested on it.
The intuition was incomplete in a way that took years to notice, because the gap it left was invisible under normal conditions.
The Bitter Lesson — Richard Sutton's influential argument that general methods leveraging computation are ultimately the most effective — was understood to be about learning. The field forgot it is also about search. Pretraining scales learning. But search — the process of exploring possibility space to find solutions — was largely absent from how frontier language models operated at inference time. Each query got a single forward pass, or something close to it. The model generated tokens autoregressively, committed to each one, and delivered whatever came out. It could not pause, reconsider, backtrack, or deliberately explore alternative reasoning paths. Its "intelligence" was crystallized at training time and retrieved at inference time, like reading from a lookup table built at enormous expense.
Chain-of-thought prompting, formalized in Wei et al.'s 2022 paper and made famous by the "think step by step" zero-shot variant, cracked this open slightly. Simply asking GPT-3 to explain its reasoning step-by-step dramatically improved its performance — and this trick was so successful that frontier labs began explicitly selecting for chain-of-thought reasoning via system prompts, prompt distillation, or instruction fine-tuning. But this was still surface-level. The model was producing more tokens before its answer, and those tokens helped, but the compute budget per problem was still roughly fixed. You were not allocating more inference compute to harder problems; you were just changing the token format. The model had no mechanism to spend more time on a difficult question than an easy one, no way to verify its own intermediate steps, and no training signal that rewarded the quality of reasoning as a process rather than the correctness of outputs as outcomes.
The gap between "generating a chain of thought" and "training a model to reason well" turned out to be enormous. That gap is what o1 crossed.
What o1 Does
In September 2024, OpenAI released o1, its first "reasoning model" — a system that exhibits test-time scaling behavior, completing a missing piece of the Bitter Lesson and opening a new axis for scaling compute. The framing matters. This was not a larger model. It was not trained on more data in any conventional sense. It was a model trained to do something qualitatively different at inference time: think.
o1 learned to scale search during inference — not through explicit search algorithms, but by being trained via reinforcement learning to improve implicit search through chain of thought. The key phrase is "trained via RL." This is not prompting. This is not instruction fine-tuning. This is a model shaped by a training process to generate long internal reasoning traces — to work through problems, notice errors, reconsider approaches, and eventually arrive at an answer that its own reasoning process endorses. The reasoning trace is not cosmetic output. It is computation happening in token space.
OpenAI's large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a data-efficient training process. The performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute) — and the constraints on scaling this approach differ substantially from those of large language model pretraining.
That last sentence deserves a moment. Training-time scaling is bounded by data quality, compute availability, and the increasingly murky question of whether enough high-quality internet text exists to feed the next model generation. Test-time scaling is bounded by something different: your willingness to spend compute on inference, and the existence of verifiable problems where more thinking helps. For a large and important class of problems — mathematics, formal verification, code synthesis, scientific reasoning — those problems exist in abundance.
The concrete performance numbers make the mechanism's power undeniable. On the 2024 AIME (American Invitational Mathematics Examination) exams — designed to challenge the brightest high school math students in America — GPT-4o solved only 12% of problems on average. o1 averaged 74% with a single sample per problem, 83% with consensus among 64 samples, and 93% when re-ranking 1,000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad. That progression — 74%, 83%, 93% — is not different models. It is the same model with different inference budgets. The capability is not fixed. It scales with compute spent at runtime.
This is genuinely new. Not new in the sense of "nobody had thought about it." New in the sense of "it now works at frontier scale and changes what the benchmarks mean."
The Machinery of Correct Reasoning: Process vs. Outcome Reward Models
To understand why o1-class models can reason better with more compute, rather than just producing more confident-sounding wrong answers, you need to understand the signal that shapes them during training. The critical distinction is between process reward models (PRMs) and outcome reward models (ORMs) — a distinction that sounds technical but has direct implications for what a model learns.
An outcome reward model evaluates a completed solution and asks a single question: is the final answer correct? It receives a binary signal. An ORM outputs a probability of correctness at every token in the sequence, but it is judged only by the final answer — reasoning errors are not captured in the ORM training process. The consequences are subtle but serious. A model trained against an ORM learns to produce outputs that look like they will be correct. It does not learn that each step of reasoning must be valid. It learns the statistical signatures of correct-looking solutions. Because an ORM has no step-level training signal, when it scores a solution highly it is saying "this kind of solution tends to be correct," not "each step in this specific solution is valid" — so a model optimized against one can learn to produce chains of thought that are superficially convincing while containing substantive errors that happen not to affect the final numerical answer.
A process reward model operates differently. A PRM scores each step of a reasoning chain, providing dense supervision: a score for every step. This signal is more informative and allows the training or search process to understand where in a chain of thought things go right or wrong. OpenAI's 2023 paper "Let's Verify Step by Step" (Lightman et al.) formalized this approach and demonstrated its superiority empirically. In those experiments, a PRM selecting the best of 1,860 generated solutions achieved 78.2% accuracy on MATH benchmark problems, compared to 72.4% for an ORM with the same selection budget — and the gap widens at intermediate selection budgets like best-of-100, where the PRM's step-level discrimination provides the most benefit.
The structural difference matters for a non-obvious reason. The ORM's automatic grading is not perfectly reliable: solutions that reach the correct answer through incorrect reasoning will be misgraded as successes. A model trained on those false positives learns to replicate reasoning patterns that happened to stumble onto correct answers — not because the reasoning was valid, but because it resembled the kind of reasoning that tends to produce correct answers. This is reward hacking at the level of logical structure. The model learns to seem right rather than be right. A concrete instance: a model solving a combinatorics problem might cancel terms incorrectly in an intermediate step, accidentally arrive at the right numerical answer, and receive full ORM credit — teaching the model that that particular cancellation move is valid when it is not.
PRMs close this loophole at a cost. Human annotators must read each reasoning step and judge whether it is correct given the problem context and prior steps — requiring domain expertise, particularly for mathematical reasoning, which means annotation is expensive and slow. Building the training data for a high-quality PRM requires humans or trusted automated systems to evaluate individual reasoning steps, not just final answers. This is feasible in mathematics and code, where ground truth can be mechanically verified. It becomes much harder in open-ended domains where "this step of reasoning is valid" is itself a contested judgment. The PRM vs. ORM choice is therefore not purely technical — it shapes which domains can be effectively targeted with reasoning models and which remain out of reach.
DeepSeek-R1 and the Geopolitical Significance of Open Reasoning
For roughly four months after o1's September 2024 release, inference-time reasoning at this scale was OpenAI's exclusive capability. That changed in January 2025.
DeepSeek-R1 (DeepSeek-AI et al., 2025) is an open-weight replication of o1-level reasoning capabilities, built on a different training approach and released with full technical transparency about its methods. The paper's central claim is striking: the reasoning abilities of large language models can be incentivized through pure reinforcement learning, without requiring human-labeled reasoning trajectories. The proposed RL framework facilitates emergent development of advanced reasoning patterns — self-reflection, verification, and dynamic strategy adaptation — achieving strong performance on verifiable tasks including mathematics, coding competitions, and STEM fields.
The RL algorithm they used is Group Relative Policy Optimization, or GRPO. Where PPO (the standard RLHF algorithm) requires training a separate value network roughly the size of the policy itself — doubling training compute — GRPO samples a group of outputs for each problem, ranks them by reward, and uses within-group relative performance as the training signal. GRPO computes the relative advantages of each group member based on their assigned rewards, estimates those advantages directly from the intra-group reward distribution rather than from an explicit value function, and updates policy parameters to maximize expected reward while minimizing divergence from a reference policy through KL divergence (a standard measure of how far two probability distributions have drifted from each other). By eliminating the need for a separate value network, GRPO offers a simplified alternative to actor-critic methods such as PPO. In practice, this means DeepSeek could run more RL training iterations within the same hardware budget — a meaningful efficiency advantage when the training runs are themselves extremely expensive.
The results were striking even to the researchers. The pass@1 score on AIME 2024 — the percentage of problems solved correctly on a single attempt — increased from 15.6% to 71.0%, and with majority voting improved further to 86.7%, matching the performance of OpenAI's o1. More interesting than the final number was what happened during training. Through reinforcement learning, the model naturally learned to allocate more thinking time when solving reasoning tasks — without any external adjustments. The model discovered, on its own, that longer reasoning traces produced better rewards. It was not told to think more carefully. It learned that careful thinking was instrumentally useful.
DeepSeek also observed what they called the "Aha moment": with increased test-time compute, the model exhibits sophisticated behaviors such as reflection, where it revisits and reevaluates previous steps, and exploration of alternative problem-solving approaches. The model figured out on its own that rethinking its approach leads to better answers. This emergence of self-correction without explicit supervision for self-correction is philosophically interesting and practically significant: you are not engineering a reasoning process, you are creating conditions under which a reasoning process emerges.
The geopolitical implications of DeepSeek-R1's release deserve direct attention. Reasoning model training is no longer OpenAI-exclusive. A Chinese AI lab produced a comparable capability in January 2025, open-sourced it under a permissive license, and published the recipe. DeepSeek-R1 incorporates multi-stage training and cold-start data before RL, and achieves performance comparable to OpenAI-o1 on reasoning tasks. Any well-resourced lab in any country can now read that paper, reproduce the training pipeline, and deploy reasoning-capable models without OpenAI's involvement. The strategic moat around inference-time reasoning is not proprietary architecture — it is scale, data quality, and the engineering sophistication to execute the training pipeline reliably. Those are significant barriers, but they are not the same as exclusive access to a secret technique.
o3, ARC-AGI, and the Limits of Benchmark Comparison
December 2024 brought o3, and with it a result that forced a rethinking of what AI benchmark scores measure.
o3 scored an unprecedented 75.7% on ARC-AGI (the Abstraction and Reasoning Corpus benchmark, designed by François Chollet to test fluid intelligence — the ability to recognize novel patterns and apply abstract rules to problems that cannot be solved by pattern-matching against training data) under standard compute conditions, with a high-compute version reaching 87.5%. Previous-generation models could not reach these results regardless of how much inference compute was applied to them. It took four years for models to progress from 0% with GPT-3 in 2020 to just 5% with GPT-4o in early 2024. Claude 3.5 Sonnet peaked at 14%. o1-preview reached 18%. Then o3 jumped to 75.7% at standard compute and 87.5% at high compute — a step-function increase in novel task adaptation ability not previously seen in GPT-family models.
The compute configuration behind the 87.5% score requires precise understanding. OpenAI tested at two levels of compute with variable sample sizes: 6 samples (high-efficiency) and 1,024 samples (low-efficiency, 172 times more compute). The 87.5% score required that 172x configuration. Each ARC-AGI puzzle at high compute cost the equivalent of thousands of dollars in inference spend. This is not a casual deployment mode. It is an existence proof: given sufficient inference compute, this model can solve problems that were previously beyond any AI system.
Here is where the benchmark comparison problem becomes acute. o3 at high compute is not meaningfully comparable to GPT-4o at standard inference. It is not even meaningfully comparable to o3 at low compute. You are looking at the same underlying model, evaluated at different points along an inference-compute axis that did not exist for the models that established prior benchmarks. When a leaderboard shows GPT-4o at 5% ARC-AGI and o3 at 87.5% ARC-AGI, those numbers are not describing two different models' fixed capabilities — they are describing two different model-plus-compute configurations. The compute is part of the result.
The same problem applies to simpler inference-time strategies. Best-of-N sampling — generating N candidate solutions and selecting the best using a verifier or majority vote — is a technique any model can use, regardless of whether it was trained for extended reasoning. On AIME, o1 averaged 74% with a single sample but 83% with consensus among 64 samples and 93% when re-ranking 1,000 samples with a learned scoring function. Majority voting among N completions is also a general-purpose technique: generate many solutions, take the plurality answer. Both approaches trade compute for accuracy without any architectural changes. A GPT-4o running best-of-100 sampling with a strong verifier is doing something qualitatively different from a GPT-4o generating a single completion, even though the model weights are identical.
This creates a measurement problem the field has not fully resolved. Test-time scaling has two dimensions — sequential and parallel. Sequential scaling increases test-time compute by extending chain-of-thought length, while parallel scaling samples multiple solutions and picks the best one. Both dimensions shift reported benchmark scores without changing model weights. Comparing pass@1 at standard compute across models is coherent. Comparing o3 at 172x compute with majority voting against GPT-4's single-pass answer is not coherent — it is comparing different budget configurations of different models and calling the result a capability comparison.
This is not merely an academic concern. Procurement decisions, policy evaluations, and deployment choices are being made based on benchmark numbers that mix these evaluation modes. A Chief AI Officer reading that o3 scored 87.5% on ARC-AGI needs to understand that this score was achieved at a specific inference budget that would be economically prohibitive at most realistic scales, and that the same model at standard compute scored 75.7% — still remarkable, but a different number with a different cost structure.
Where the Clean Story Breaks
The o1/o3 narrative is compelling and, in its broad outlines, correct. Several complications deserve careful attention from anyone making decisions based on reasoning model capabilities.
The first complication is about what "thinking more" means for these models. Research examining whether o1-class models exhibit true sequential test-time scaling — whether longer reasoning traces reliably produce better answers — has found the picture murkier than the original results suggested. For R1 and QwQ (an open-source reasoning model from Alibaba's Qwen team), extending solution length does not necessarily yield better performance due to the models' limited self-revision capabilities. The models may generate longer reasoning without effectively revising their approach. Parallel findings attribute this to model "underthinking," where models initially reach correct intermediate solutions but subsequently deviate toward incorrect conclusions during extended reasoning. More tokens of thought is not the same as better thought. The scaling is real, but it is not monotone, and it does not simply mean "let the model run longer on every problem."
The second complication concerns GRPO itself. Research has identified an optimization bias in Group Relative Policy Optimization that artificially increases response length — especially for incorrect outputs — during training. A model trained with GRPO may learn to generate longer responses not because longer responses are better reasoned, but because the optimizer rewards length as a proxy for quality under certain data distributions. This is reward hacking of a subtler kind than the ORM false-positive problem: the model learns that certain structural features of "effortful" responses — length, use of reflective language like "wait" or "let me reconsider" — correlate with reward, and it learns to produce those features without necessarily producing better reasoning. The "Aha moment" behavior observed in DeepSeek-R1-Zero may be genuine emergent reasoning, or it may partly be a learned performance of reasoning. The model discovered that saying "I should reconsider this" is rewarded, independent of whether the reconsideration is substantive.
The third complication is domain specificity. Reasoning models perform dramatically better on problems with verifiable, objective answers — mathematics, formal logic, code execution results. The reward used for R1-Zero relies on verifiable correctness checks: Does the generated code compile? Does the mathematical expression give the correct result? Does the generated answer follow the specified format? These checks are automatable and reliable. But the reasoning model paradigm depends on training against correct reasoning at scale, which requires that "correct" be mechanically determinable. For open-ended policy analysis, strategic judgment, risk assessment, or any domain where the quality of reasoning is itself contested, the PRM/ORM training infrastructure does not apply. o3 at 87.5% on ARC-AGI tells you nothing direct about whether the same model can reason more carefully about geopolitical risk or regulatory strategy — domains without ground-truth verification.
The fourth complication is the ARC-AGI result itself. Chollet noted that ARC-AGI v1 is nearing saturation, with ensemble approaches already scoring above 81%, and that the forthcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3 — potentially reducing its score to under 30% at high compute, while a human with no training would still score over 95%. A benchmark that o3 scores 87.5% on at maximum compute, and that an untrained human scores over 95% on, is not measuring the same thing in both cases. ARC-AGI was designed to resist pattern-matching from training data. o3 was tuned on 75% of the ARC-AGI public training set. The extent to which the high-compute score reflects genuine fluid reasoning versus very expensive near-memorization remains unresolved. The sharp projected drop to under 30% on ARC-AGI-2 — a benchmark specifically redesigned to close the loopholes that o3's training exposure may have exploited — would, if confirmed, suggest the 87.5% score overstates generalizable reasoning ability by a substantial margin.
One further wrinkle. DeepSeek-R1-Zero showed that reinforcement learning at scale can directly enhance reasoning capabilities without supervised fine-tuning — but critical examination of R1-Zero-like training raises questions about both the base models and the RL process. The base model's pretraining characteristics matter enormously: DeepSeek-V3-Base already exhibited "Aha moment" behaviors before RL training, while Qwen2.5 base models demonstrated strong reasoning capabilities without prompt templates, suggesting pretraining biases were already present. The emergent reasoning behaviors may be less "emerged from RL" and more "unlocked from pretraining" — a subtler and less generalizable phenomenon.
None of this negates the core finding. Test-time compute scaling is real. The same trained model produces genuinely different capability levels depending on inference budget. This is a new axis of AI capability that breaks prior assumptions and requires new evaluation frameworks. But the mechanisms are messier than the clean story, the domain specificity is real, and the benchmark interpretation problem is serious enough to distort decisions if readers take headline numbers at face value.
The Strategic Implication You Have to Live With
The old question — "what can this model do?" — assumed a static answer. Deploy it, run it, measure it. The answer was a number on a leaderboard, comparable across models because everyone was spending roughly the same compute per query.
That question no longer has a static answer for o1-class systems. The correct question is: "what can this model do at what inference cost?" That requires organizations to make a choice that was previously invisible — how much inference compute are you willing to spend per problem, and on which problems? A reasoning model running at low compute is a different product than the same model at high compute. The capability difference is not marginal. It is the difference between GPT-4o's 5% on ARC-AGI and o3's 87.5% — a gap that accumulated over four years of pretraining scaling and that inference-time compute then compressed dramatically for specific problem types.
Consider what this means structurally. One AI system's inference time is a future AI system's training time. The reasoning traces generated by o1 and its successors are being used to create training data for subsequent model generations — distilling extended chain-of-thought back into models that can reason more efficiently without burning as much inference compute. The training/inference boundary is not fixed. It is a loop. The compute you spend on inference today trains the model that tomorrow's deployment runs more cheaply. DeepSeek's own pipeline illustrates this concretely: R1's long reasoning traces were used to fine-tune smaller "distilled" versions — including a 7-billion-parameter model — that retain meaningful reasoning capability at a fraction of the inference cost, by learning directly from the larger model's extended thought process rather than from scratch via RL.
There is ongoing debate about whether traditional AI scaling laws — more compute, more data, larger models — still hold, or whether test-time scaling with different inference architectures represents the next path forward. That debate is not settled. Both axes appear real. The question is how they interact, what their respective ceilings are, and whether reasoning models will remain expensive-per-query systems or whether the distillation loop will eventually produce models that think carefully without thinking slowly.
The practitioner's challenge is immediate: current evaluation infrastructure — benchmarks, leaderboards, vendor scorecards — was built for a world where models had fixed capability. Comparing o3 at 172x compute against GPT-4 at standard inference is not a fair fight, but it is routinely presented as one. The number you see on a reasoning model benchmark is always a specific point on a capability-cost curve, not a description of the model.
Before that number enters any procurement or deployment decision, ask one question: what compute budget was that score achieved at, and what would it cost to achieve that score in your actual production environment? The gap between the benchmark compute and your production compute may be precisely where the capability you were promised disappears.