M4E1: Chinchilla Scaling Laws: Why Frontier Models Are Deliberately Overtrained
Chinchilla Scaling Laws: Why Frontier Models Are Deliberately Overtrained
The Consensus That Was Wrong
By the time OpenAI released GPT-3 in May 2020, a fairly stable set of beliefs had settled across the major AI labs about how to build the most capable language models. The logic was nearly self-evident: more parameters capture more structure in language, more structure means better generalization, better generalization means better performance. Scale the model. The data would follow. This wasn't arbitrary intuition — it was grounded in the Kaplan et al. scaling laws published earlier that year, which suggested that under a fixed compute budget, model size should be prioritized over dataset size. The implication, widely adopted, was that the correct move was to train progressively larger models on a roughly fixed corpus of text.
Before the Chinchilla research, the dominant approach had been to increase model size, often at the expense of training data. GPT-3 was the purest expression of this philosophy. Trained as an autoregressive language model with 175 billion parameters — ten times more than any previous non-sparse language model — it was simultaneously the most impressive and, as would become clear two years later, one of the most underutilized training investments in the field's history. With 175 billion parameters, it was trained on approximately 300 billion tokens, representing a ratio of roughly 1.7 tokens per parameter.
That ratio — 1.7 tokens per parameter — seems almost absurd in retrospect. GPT-3's 175 billion parameters saw, on average, fewer than two training examples per weight. The model was enormously expressive in its capacity, yet chronically deprived of the signal needed to fully exercise that capacity. The largest dense transformer at the time, MT-NLG 530B, was over three times larger than GPT-3 yet had been trained on a comparable number of tokens — around 300 billion. The field was running a race by adding legs to the runner while keeping the distance fixed. Nobody had rigorously asked whether the race distance itself was the problem.
The reason this went unchallenged for so long involves a structural incentive misalignment at the institutional level. Training large models was expensive, novel, and headline-generating. The question of how many tokens to train on was treated as secondary — a logistical detail rather than a design variable. When GPT-3 achieved remarkable few-shot performance on a wide range of tasks, there was no obvious mechanism to ask whether an even smaller model trained on far more data might have matched or exceeded it. You don't run the counterfactual you don't know to run.
What Hoffmann et al. Proved
In March 2022, a team at DeepMind led by Jordan Hoffmann published a paper that would become one of the most consequential empirical results in the history of language model training. The paper's title was deliberately plain: "Training Compute-Optimal Large Language Models." The result it contained was not plain at all.
The paper investigated the optimal model size and number of tokens for training a transformer language model under a given compute budget. It found that current large language models are significantly undertrained — a consequence of the recent focus on scaling language models while keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, the authors found that for compute-optimal training, model size and the number of training tokens should be scaled equally: for every doubling of model size, the number of training tokens should also be doubled.
The methodology matters. The authors used three complementary methods: training many models at fixed compute budgets while varying the model size, then fitting a power law to find optimal model size; fitting parametric loss functions to training runs and analytically minimizing under a compute constraint; and fitting individual loss curves to extrapolate optimal pairs at many compute levels. All three methods converged on the same relationship, with approximately 20 tokens per parameter. This wasn't a single-experiment result — it was convergent evidence from three independent analytic approaches, each returning the same number.
To prove the hypothesis in practice, Hoffmann et al. built Chinchilla itself — a 70-billion-parameter model trained on 1.4 trillion tokens. The comparison was clean by design: Gopher and Chinchilla had the same training compute budget, but Chinchilla used roughly four times fewer parameters and four times more data. The same total FLOPs, allocated differently.
The results were not marginal. Chinchilla uniformly and significantly outperformed Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) across a large range of downstream evaluation tasks. On MMLU — the Massive Multitask Language Understanding benchmark covering 57 academic subjects — Chinchilla reached a state-of-the-art average accuracy of 67.5%, more than 7 percentage points above Gopher. A model roughly a quarter the size had just beaten it decisively — not by architectural innovation, not by better fine-tuning, but simply by being trained the way it should have been trained all along.
The secondary implication of the paper, buried in the discussion but analytically significant: the energy cost of a large language model is not concentrated in training alone. It is amortized through usage for inference and fine-tuning, which means the benefits of a more optimally trained smaller model extend beyond improved benchmark performance. The paper planted a seed that the field would spend the next two years excavating.
What the Chinchilla result revealed, beneath its empirical surface, was a systemic misallocation that had persisted across all the major frontier labs simultaneously. GPT-3 was undertrained. Gopher was undertrained. A model like GPT-3 with 175 billion parameters should have been trained on roughly 3.5 trillion tokens to be compute-optimal — or, by the inverse argument, should have been about 20 times smaller, closer to 15 billion parameters. The organizations spending the most on frontier AI had all been making the same mistake, in the same direction, for the same reason. Kaplan et al.'s original scaling law paper had included a dataset size term in its loss formula, but the fitted exponents led to the conclusion that model size should scale faster than data — a finding that the Hoffmann analysis directly contradicted, attributing the discrepancy in part to the earlier study's practice of stopping training runs early before the loss curves fully converged.
LLaMA and the First Correction
The immediate practical uptake of Chinchilla's findings was fastest not at OpenAI or Google, but at Meta. In February 2023, Meta's FAIR (Fundamental AI Research) team released LLaMA — the Large Language Model Meta AI — which became the most important open-weights model release in the history of the field to that point, and the paper that precipitated the open-source LLM movement.
The design philosophy behind LLaMA was explicit: build for inference efficiency. If you're going to deploy a model at scale, the cost you pay repeatedly is inference cost, not training cost. Training is a one-time event. Inference runs continuously for years. Building a smaller, better-trained model means every query is cheaper to serve, every deployment is faster to load, and every fine-tuning experiment requires fewer resources. The LLaMA models were explicitly designed following compute-optimal principles, with model sizes ranging from 7 billion to 65 billion parameters, each trained on significantly more data than previous models of similar size.
The headline result confirmed the Chinchilla thesis dramatically. Despite having a significantly smaller parameter count than GPT-3, the LLaMA 13B model outperformed GPT-3 on most benchmarks, while the 65B model achieved performance comparable to top models like Chinchilla-70B and PaLM-540B. A 13-billion-parameter model — which fits on a single consumer GPU — beating the 175-billion-parameter model that had defined the frontier two years earlier. The difference wasn't magic. It was training tokens. The 65B and 33B models used approximately 1.4 trillion tokens, while the 7B and 13B models used about 1 trillion tokens.
This was the first correction the field needed to make, and LLaMA made it cleanly. The 65B model at roughly the Chinchilla-optimal 20:1 ratio demonstrated that compute-optimal training worked exactly as advertised. But LLaMA also, somewhat inadvertently, revealed something Chinchilla hadn't directly addressed: the 7B model, trained on 1 trillion tokens, was running at approximately 143 tokens per parameter — far past the Chinchilla optimum — and it was still improving at the end of training. The loss curves hadn't flattened. There was signal left on the floor.
Strict adherence to Chinchilla scaling leads to what researchers have called the "Chinchilla Trap" — you end up with a model that is too large and therefore expensive to run at large scale during inference. As Touvron et al. noted in the LLaMA paper, loss continues to decrease beyond the Chinchilla-optimal point. The Chinchilla laws told you how to maximize model quality for a given training compute budget. They said nothing about what happened if you had more data available and were willing to spend extra training compute to get a smaller model. They optimized training efficiency. The industry was beginning to ask a different question.
The Decision to Overtrain
LLaMA-2, released in July 2023, made the strategic intention explicit. Where LLaMA's 65B model trained at approximately the Chinchilla-optimal ratio, the LLaMA 2 family trained on 2 trillion tokens — far more data than Chinchilla would deem optimal. LLaMA 3, released in 2024, went further still. LLaMA 7B trained on 1 trillion tokens, LLaMA 2 7B trained on 2 trillion, and LLaMA 3 8B trained on 15 trillion tokens. That final number — 15 trillion tokens for an 8-billion-parameter model — represents a token-to-parameter ratio of roughly 1,875:1. The Chinchilla-optimal ratio is 20:1. LLaMA-3's 8B model was trained at nearly one hundred times the Chinchilla-optimal volume.
Each successive generation deliberately overtrained past the compute-optimal point, not because Meta's researchers forgot what Chinchilla said, but because they understood it — and chose to optimize for something different.
The underlying logic is a straightforward economic calculation, worth working through precisely. Chinchilla defines "compute-optimal" as minimizing training compute for a given target loss. Under that definition, training beyond the optimal point is waste — you're spending FLOPs on training that you could have spent on a larger model. But "compute-optimal training" is only optimal if training compute is the thing you care about minimizing. If your actual objective is total compute over the model's operational lifetime, the calculation changes entirely.
Inference costs depend on model size and the volume of user queries over the model's lifetime. That volume can be significant — demand for popular models can exceed billions of tokens per day. Training runs once. Inference runs millions of times, potentially for years. A model that required twice as much compute to train but is half the size will, over any sufficiently large inference volume, pay back that extra training cost many times over. Every training token is roughly three times more expensive than every inference token; you need to pass in at least three inference tokens for every extra training token for the compute overhead to be worth it. Overtraining only makes sense for models that will receive high usage.
That condition — high usage — is precisely the situation every frontier lab faces when deploying a consumer-facing API. GPT-4 served hundreds of millions of users. LLaMA-3-8B was downloaded and deployed millions of times across organizations running it continuously on their own infrastructure. In these regimes, inference volume isn't measured in billions of tokens — it's measured in quadrillions over the model's service lifetime. Researchers at MosaicML (now Databricks) formalized this tradeoff in "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws," finding that model quality continues to improve as token-to-parameter ratios increase to extreme ranges — up to 10,000 tokens per parameter, or 500 times beyond Chinchilla-optimal — though further testing is needed at such extreme scales. The MosaicML paper introduced a combined cost function that adds training FLOPs and inference FLOPs together before optimizing, and showed that the optimal model size under this joint objective shrinks substantially as projected query volume increases — meaning that the higher the anticipated deployment traffic, the smaller the model you should train, and the longer you should train it.
This trend is driven in part by the popularity of smaller models in the 1B–70B parameter range that are easier and cheaper to fine-tune and deploy. The LLaMA-3-8B model, trained on 15 trillion tokens, can run on a single A100 GPU. The Chinchilla-optimal model with equivalent loss — roughly 500 billion parameters — cannot. The choice to overtrain the 8B model didn't just improve its benchmark scores; it put the model inside the hardware envelope of every serious organization that wanted to self-host.
The entire open-weight ecosystem followed. Mistral-7B, released in September 2023, trained a 7-billion-parameter model to a degree that made it competitive with LLaMA-2's 13B model on most benchmarks. Qwen 3 from Alibaba followed a similar pattern. The Gemma models from Google DeepMind, the Phi series from Microsoft Research, and virtually every serious open-weight release from 2023 onward made the same choice: accept higher training compute, achieve a smaller and more capable model that costs less to serve continuously. The Chinchilla paper had defined the compute-optimal frontier. The industry collectively decided that compute-optimal training was not what they were optimizing for.
Data Curation as the Second Lever
Understanding overtraining only makes sense alongside a parallel development that received far less public attention: the industrialization of data curation. You cannot overtrain a model to 1,875 tokens per parameter if you don't have 1,875 times as many tokens as parameters, at sufficient quality to extract signal from each pass. LLaMA-3's 8B model needed 15 trillion tokens of usable training data. Assembling that corpus was, in many respects, as significant an engineering challenge as training the model itself.
The naive approach to data collection is simply to scrape Common Crawl — the non-profit web archive that provides periodic snapshots of the indexed internet — and tokenize whatever you find. The problem is that the raw web is overwhelmingly composed of low-quality text: duplicate content, spam, SEO-optimized boilerplate, machine-generated filler, and documents with no syntactic coherence. Training on this material at scale doesn't just fail to improve a model — it actively degrades it. The model learns patterns that shouldn't be learned, and the noise competes with genuine signal in ways that are difficult to diagnose downstream.
The research community responded by treating data curation as a first-class technical problem. FineWeb, introduced by Hugging Face, is a 15-trillion-token dataset derived from 96 Common Crawl snapshots that produces better-performing language models than other open pretraining datasets. The authors carefully documented and ablated all design choices, including investigations of deduplication and filtering strategies.
FineWeb's core contribution wasn't scale — it was the systematic study of which filtering decisions moved the needle. Filtering choices were validated by training proxy language models on candidate data slices and measuring downstream benchmark performance, not by heuristics alone. Effective deduplication removes near-identical content while preserving diversity, preventing the model from overfitting to repeated patterns in web data. Earlier pipeline designs had relied on hand-crafted rules about document length, language detection scores, and character n-gram frequencies. FineWeb ran ablations — trained actual models on data slices with and without each filtering step — and measured the downstream effect. Every processing decision had to earn its place. One concrete finding from those ablations: applying MinHash deduplication at the document level, rather than at the line or paragraph level, produced measurably better downstream benchmark scores, because paragraph-level deduplication was too aggressive and removed structurally similar but semantically distinct educational passages that the model benefited from seeing.
FineWeb-Edu, a 1.3-trillion-token collection of educational text filtered from FineWeb, produced language models with markedly better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC (AI2 Reasoning Challenge, a benchmark testing grade-school science question answering). The mechanism is particularly instructive. To build the synthetic annotations, the Llama-3-70B-Instruct model was used to score 460,000 randomly sampled webpages from FineWeb's Common Crawl snapshot for educational quality on a scale from 0 to 5. Those scored examples then trained a lightweight classifier, which was applied to the entire corpus to identify and retain highly educational documents — discarding the vast majority that scored below the threshold.
DCLM — DataComp for Language Models — takes a different approach. It extracts a standardized corpus of 240 trillion tokens from Common Crawl dumps, then curates down to a 3.8 trillion token dataset using heuristic rules, MinHash and Bloom filter deduplication (both are probabilistic algorithms for identifying and removing near-duplicate documents at scale), and a fastText classifier (a lightweight text classification tool) trained on high-quality instruction-formatted data to keep the top 10% of documents. What DCLM demonstrated was that aggressive filtering — discarding ninety percent of available data — produces a corpus that, token for token, is far more informative for model training than an unfiltered or lightly filtered alternative. You can train a better model on 3.8 trillion carefully selected tokens than on 30 trillion tokens of average-quality text. The DCLM benchmark also established a controlled evaluation framework in which any research group could test a new filtering method against a fixed compute budget and a standardized held-out test set, making it possible to compare data curation strategies the same way benchmark leaderboards compare model outputs.
The general principle, confirmed across multiple independent research efforts, is that data quality and data quantity interact in a non-linear way. Past a certain threshold of total tokens, the marginal value of adding more tokens depends almost entirely on the quality of those tokens. Projects like FineWeb-Edu and DCLM chose to drop the lowest-quality 90% of data to maximize average quality. The art — and it remains more art than science — is maintaining domain diversity while enforcing quality standards, so the model sees enough variation to generalize without being polluted by noise.
High-quality data also shifts the optimal compute allocation: with cleaner signal, a larger fraction of the training budget should go toward model capacity rather than additional token volume. This creates a coupling between data curation and architectural decisions that makes the two inseparable. When a lab announces a new model's parameter count and token count, the number that typically goes unstated — the quality distribution of those tokens — may have as much influence on final benchmark performance as either headline figure.
The move toward overtrained smaller models both requires and rewards aggressive data curation. If you're going to train an 8-billion-parameter model on 15 trillion tokens — a ratio nearly two orders of magnitude past Chinchilla-optimal — those 15 trillion tokens need to be worth training on. Low-quality training signal at massive scale doesn't just plateau; it actively harms the model in ways that compound across training. Duplicated documents appearing thousands of times in the training corpus cause the model to memorize patterns that hurt generalization, inflating apparent confidence without actual capability. Dataset diversity is equally critical: models trained longer on large data mixtures benefit from exposure to more concepts and contexts, but vulnerability to overfitting and bias makes high-quality curation non-negotiable.
The Inference Economics of the Frontier
The full arc from GPT-3 to LLaMA-3 to today's frontier models — GPT-5, Claude 4, Gemini 2.5, DeepSeek R2, Qwen 3 — makes sense only when read through the lens of inference economics, not training economics. These are models whose architecture, size, and training data volume have been selected with explicit attention to what they cost to run at scale, not just what they cost to build.
The Chinchilla paper's deepest contribution was not the 20-token-per-parameter rule. That rule was immediately superseded by the industry's move toward overtraining. The deeper contribution was making explicit that training cost and inference cost are separable objectives — that the optimal allocation between model capacity and training data depends on what you intend to do with the model after training, and that the field had been conflating the two. After training, large language models are put into production and served to users. Minimizing inference costs generally favors a smaller model trained for longer over a larger model trained to the Chinchilla optimum. The parametric scaling law helps determine the optimal tradeoff between model size and training duration, given a budget for total inference and training costs.
This reframing has direct strategic implications for how frontier models are evaluated and how deployment decisions are made. A model's inference cost per token is determined primarily by its active parameter count at inference time. For dense models, that's the full parameter count. For mixture-of-experts architectures — like Mixtral, or variants appearing in GPT-5 and Gemini 2.5 Ultra — it's the number of parameters in the active expert pathways per forward pass, typically a small fraction of total parameters. Both architectural approaches attempt to decouple capability (which scales with total parameters and training data) from inference cost (which scales with active parameters per token). Mixtral 8x7B, for instance, routes each token through 2 of its 8 expert feed-forward layers per forward pass, meaning that although the model has roughly 46 billion total parameters, each token activates only around 13 billion — achieving inference costs closer to a 13B dense model while retaining the representational capacity of a much larger one.
The LLaMA-3 8B model, running at billions of tokens per day across thousands of enterprise deployments, generates more total inference cost reduction from its smaller footprint than any equivalent training savings would have produced. The training investment was made once. The inference dividend is paid continuously.
Benchmark comparisons and parameter counts — the two numbers most commonly used to assess model quality — are necessary but radically insufficient. A model's inference cost profile depends on its size, architecture, quantization options, and increasingly on the hardware stack it's been optimized for. Two models with identical benchmark performance may have inference costs that differ by a factor of five or ten. That difference is invisible on any standard leaderboard. It only becomes visible when an organization is running queries at production volume.
The question of which model to use is properly a function of two independent variables: how much does this model cost to run per token at our expected query volume, and does the quality of outputs at that volume meet our threshold? For low-volume, high-stakes tasks, a frontier dense model at high per-token cost may be correct. For high-volume, routine-classification tasks, an aggressively overtrained 8B model at a fraction of the per-token cost may deliver equivalent task quality at dramatically lower total cost. Neither choice is uniformly correct. They are made correctly only by organizations that have separated training quality from inference economics in their evaluation frameworks.
The Question You Are Now Equipped to Ask
The laboratories building today's frontier models — Anthropic with Claude 4, Google with Gemini 2.5, OpenAI with GPT-5, Meta with LLaMA-3 and its successors — are not primarily competing on training cost. Training cost is a one-time expenditure the largest organizations can absorb. They are competing on inference efficiency: the quality of output per token of inference compute, at the scale of millions of users generating billions of tokens daily. Every architectural decision, every data curation choice, every decision about when to stop training a particular model size and start a new run at a different size, is being made against that objective function.
Chinchilla revealed that the field had been asking the wrong question for years. "How large should the model be?" dominated from 2018 through 2022. The question the industry is now organized around is: "What inference cost per token are we designing for, and what is the minimum model size that achieves target quality at that cost?" The answer determines the training data volume, which determines the data curation pipeline, which determines the entire infrastructure and investment pattern surrounding the training run.
Training smaller models for longer makes them high-performing and cheap to run at inference time. That finding, derived from the trajectory of the LLaMA series, is now the operating thesis of the entire open-weight ecosystem and an increasing fraction of the proprietary one. The labs that internalized this earliest — and built data pipelines and training infrastructure capable of executing on it — hold a structural advantage that is not about raw compute. It is about how intelligently that compute is allocated, and against which objective.
The organization still selecting AI systems primarily on training-time benchmarks and headline parameter counts in 2026 is making the same category error that GPT-3 represented: optimizing for the wrong variable, in the wrong direction, with high confidence. Chinchilla didn't just correct a training recipe. It revealed the question that should have been at the center all along — and the field's subsequent four years of deliberate overtraining have been a massive empirical proof that the answer is almost never "build a bigger model."