M2E1: The Training Loop: Loss, Gradients, and Why Adam Replaced SGD

Module 2, Episode 1: The Training Loop — Loss, Gradients, and Why Adam Replaced SGD


A model learns by repeatedly measuring how wrong it is and adjusting in the direction that reduces that error — and the choice of how you measure "wrong" and how you adjust determines almost everything about what the model learns to do. This is a literal description of the mechanism, not a metaphor or a simplification. Every consequential decision in model development — from architecture to deployment behavior — flows from choices made inside this loop. Understanding the loop in precise mechanical terms is the foundation on which every meaningful question about model capability, failure mode, and alignment rests. For anyone who oversees AI systems, it is prerequisite knowledge.

This chapter works through that loop in full: what happens at each step, why the mathematics of each step were designed the way they were, where the design choices introduce irreducible tradeoffs, and what those tradeoffs mean for the systems being deployed in 2025 and 2026. The story begins with a simple sequence of operations and ends with a challenge that no optimizer has yet solved.


The Four-Step Loop: Forward, Loss, Backward, Update

Strip away every complexity and the training process is a loop that executes the same four operations billions of times. Call it a training step. In each step, the model makes a prediction, the prediction is compared to the correct answer, that comparison produces a signal, and the model's parameters are adjusted in response to that signal. Four steps. Repeated until the model is either good enough or you run out of compute budget.

The first operation, the forward pass, takes a batch of input data and runs it through the model's current parameters to produce an output. For a language model, this means taking a sequence of tokens — say, the first forty words of a Wikipedia paragraph — and producing a probability distribution over the entire vocabulary for what token should come next. The model is not retrieving the answer; it is computing it fresh, from the current state of all its weights, on every single training step. A model with 70 billion parameters applies those parameters to every input in every batch of every training run. Training is expensive not because any single forward pass is complicated, but because the number of forward passes required to move a model from random initialization to usable capability is measured in trillions.

The second operation is loss computation. Once the model has produced its prediction, you compare it to the ground truth and quantify the discrepancy with a scalar number — the loss. The loss answers the question "how wrong is the model right now, in a form that calculus can use?" The choice of loss function is among the most consequential modeling decisions made before training begins, because it determines what the model is optimizing for. The field converged on cross-entropy loss for language modeling, and the reasons for that convergence are worth understanding precisely.

Cross-entropy measures, in bits, how surprising the true token is under the model's predicted distribution. Minimizing cross-entropy means the model is learning to assign high probability to the tokens that appear in the training corpus. For a vocabulary of 100,000 tokens, if the model's distribution is uniform — a completely random guess — the loss is the log of 100,000, roughly 11.5 nats. A well-trained model on fluent English text will push that value well below 2 nats per token. The distance between those two numbers represents the aggregate learning that occurred across the entire training run.

Next-token prediction trains a large language model (LLM) by turning ordinary text into a supervised learning problem where the label for each position is simply the token that comes next. This is why LLM pretraining is often called self-supervised: the targets come directly from the text itself rather than from manual annotations. Every sentence ever written is simultaneously an input and a set of labeled examples, without any human annotation work. The entire training corpus of a frontier model — the trillions of tokens drawn from web crawls, books, code repositories, and scientific papers — serves as its own ground truth. The sheer scale of that implicit label set is what makes the next-token objective capable of producing general-purpose capabilities.

Cross-entropy is not the only possible loss function, and the choice matters. Mean squared error (MSE) measures the average squared distance between prediction and target — a natural fit for regression problems where outputs are continuous values, such as predicting housing prices or sensor readings. MSE penalizes large errors quadratically, making it sensitive to outliers; a single large prediction error contributes disproportionately to the total loss. For classification tasks with discrete outputs, MSE is a poor fit: the geometry of probability distributions over categories doesn't align with the geometry of Euclidean distance. Cross-entropy is information-theoretically grounded — directly connected to maximum likelihood estimation of the underlying data distribution — which is why it dominates classification and language modeling. The point is not that cross-entropy is always correct but that loss function selection is a modeling decision about what constitutes success, and different choices produce models with different learned priorities.

The third operation is the backward pass. Once you have a scalar loss value, you need to know how each of the model's parameters contributed to that loss — specifically, how changing each parameter would change the loss. This is what backpropagation computes. Using the chain rule of calculus, it propagates the error signal backwards through every layer of the network, computing the gradient of the loss with respect to each parameter. The gradient is a vector — one value per parameter — that points in the direction of steepest increase in the loss. You want to decrease the loss, so you move in the opposite direction.

The fourth operation is the parameter update: take the current parameter values, subtract some fraction of the gradient, and write the result back as the new parameter values. The fraction is controlled by the learning rate, a scalar that determines how large a step you take on each update. Too large, and you overshoot minima, oscillating or diverging. Too small, and training becomes impractically slow. This four-step loop — forward, loss, backward, update — is the entire mechanism. Everything else is engineering built on top of it.


The Geometry of Learning: Why High-Dimensional Loss Landscapes Are Navigable

Before examining why specific optimizers work or fail, it helps to develop the geometric intuition for what optimization means for a model with billions of parameters.

The loss landscape is a mathematical surface. Each point on the surface corresponds to a specific configuration of all the model's weights — a specific value for every one of, say, 70 billion parameters — and the height of that point is the loss that configuration produces on the training data. Training is a search through this surface for low points. In two dimensions, this is easy to visualize: a bowl-shaped valley has one minimum, and gradient descent rolls downhill to find it. Real neural networks are nothing like this.

A 70-billion-parameter model lives in a 70-billion-dimensional space. The loss landscape has 70 billion axes, and intuitions from low-dimensional geometry systematically mislead. The most important of these misleading intuitions involves saddle points. In two dimensions, a saddle point — a point where the gradient is zero but you're at neither a maximum nor a minimum — is a rare special case. In high dimensions, saddle points are ubiquitous. Almost every point where the gradient approaches zero has some directions of increasing loss and some directions of decreasing loss. This was a serious theoretical concern for early deep learning: the gradient approaches zero, the optimizer stalls, training fails.

The empirical resolution to this concern is surprising. In practice, the high dimensionality that creates abundant saddle points also ensures that those saddle points have extremely low loss values. To be trapped at a bad saddle point in high-dimensional space, you would need almost all directions to curve upward simultaneously — a geometrically improbable configuration at high loss values. The saddle points that occur during training tend to be in regions of already-low loss, where getting trapped causes minimal harm to final performance. Theoretical understanding has lagged behind empirical practice in explaining this for years.

Local minima pose a related concern. In low dimensions, gradient descent can get trapped in a local minimum — a valley that is not the globally lowest point. In high-dimensional spaces, true local minima substantially worse than the global optimum are rare for the same structural reasons. What you more commonly encounter are flat regions, sometimes called loss plateaus, where the gradient is small but nonzero. These are navigable with a capable optimizer; they are just slow. The geometry of high-dimensional loss landscapes is more benign than naive intuition suggests, which is part of why gradient descent-based methods work as well as they do on neural networks despite having no theoretical guarantee of finding the global optimum.

The Chinchilla paper — Hoffmann et al.'s 2022 work on compute-optimal training — revealed something structurally important about this landscape at scale. The paper investigated the optimal model size and number of training tokens for a transformer language model under a given compute budget and found that large language models prior to 2022 were significantly undertrained, a consequence of scaling model size while holding training data roughly constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, Hoffmann et al. found that for compute-optimal training, model size and training tokens should scale equally: for every doubling of model size, the number of training tokens should also double. The loss landscape is not fixed — it changes as you scale data and parameters together. More training tokens don't just push you further down the same slope; they change the topology of the surface you're navigating.


SGD and Its Failure Modes: What the Classic Optimizer Gets Wrong

Stochastic Gradient Descent (SGD) is the progenitor of all modern optimizers. Its logic is simple: compute the gradient of the loss on a small random batch of training examples, take a step proportional to that gradient, repeat. The "stochastic" part distinguishes it from pure gradient descent on the full dataset — instead of computing the exact gradient across all training data on each step, you estimate it from a small sample. This approximation introduces noise that helps the optimizer escape sharp local minima. But SGD has several structural problems that become acute at the scale of modern language model training.

The first is the noise problem. Because each gradient update is estimated from a small batch, it's a noisy estimate of the true gradient direction. Some noise is useful. But when the loss landscape has regions of very different curvature — steep in some directions, flat in others — the noise can dominate the signal in the flat directions, causing slow, erratic progress. You can reduce this by increasing the batch size, averaging the gradient over more examples, but large-batch training introduces its own instabilities and requires careful tuning of the learning rate.

The second problem is uniform learning rates. Standard SGD applies the same learning rate to every parameter in the model. In a transformer with 70 billion parameters, some parameters receive dense gradient signals on every training step — the embedding weights, for instance, are touched by virtually every token in every batch. Other parameters receive sparse signals, activated only when particular patterns appear in the input. Applying the same learning rate to both means that densely-updated parameters oscillate while sparsely-updated parameters barely move — or that sparsely-updated parameters take reasonable steps while densely-updated parameters overshoot. Neither outcome is acceptable.

The third problem is sensitivity to the learning rate schedule. SGD requires the practitioner to specify not just a learning rate but a decay schedule: how much to reduce the learning rate over the course of training. Getting this wrong is catastrophic. Start with too high a learning rate and the model diverges immediately. Reduce it too aggressively early and the model converges to a suboptimal configuration it can never escape. At the cost of training frontier models, extensive tuning runs mean millions of dollars spent learning what learning rate to use.

Momentum-based SGD addresses part of this by accumulating a velocity vector: instead of stepping in the direction of the current gradient, you step in a weighted combination of the current gradient and the historical direction of previous steps. This smooths out noise and accelerates convergence in consistent directions. SGD with momentum is still used for computer vision tasks — ResNets, ConvNets, and similar architectures often train well with it given careful tuning. But for language models with their sparse activations, heterogeneous parameter update frequencies, and sensitivity to early training dynamics, SGD with momentum proved insufficient. The field needed something structurally different.


Adam: Adaptive Learning Rates and Why They Changed Everything

Adam (Adaptive Moment Estimation), introduced by Diederik Kingma and Jimmy Ba in 2014, is an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. It is computationally efficient, has modest memory requirements, is invariant to diagonal rescaling of the gradients, and handles problems that are large in terms of data or parameters. It is also appropriate for non-stationary objectives and problems with noisy or sparse gradients.

The key word is "adaptive." Adam maintains two running statistics for each parameter: the first moment, which is the exponentially-weighted moving average of the gradient (essentially, what direction has this parameter been moving?), and the second moment, which is the exponentially-weighted moving average of the squared gradient (essentially, how large have this parameter's gradients been?). The actual update for each parameter is the first moment divided by the square root of the second moment — the recent average gradient direction, scaled by how volatile that gradient has been.

The consequence is per-parameter adaptive learning rates. A parameter that receives large, consistent gradients gets a relatively small effective learning rate — the signal is strong and reliable, so large steps aren't needed. A parameter that receives small, infrequent gradients gets a relatively large effective learning rate — it needs larger steps to make meaningful progress when its gradient signal finally arrives. Embedding lookup layers and other modules that activate infrequently can learn efficiently within the same optimizer and the same training run as modules that activate constantly. No separate tuning required.

The two exponential decay hyperparameters, conventionally called beta-1 and beta-2, control how much historical gradient information is retained. The defaults — 0.9 for beta-1 and 0.999 for beta-2 — are remarkably stable across different architectures and datasets. This stability is one reason Adam dominated: it transferred across problem domains without the per-task tuning that SGD demanded. A practitioner training a vision model, a language model, or a generative adversarial network could start with the same Adam configuration and have a reasonable chance of sensible training dynamics on the first attempt.

Adam also includes bias correction terms in its first and second moment estimates. In the early steps of training, both moment estimates are initialized at zero and are biased toward zero. Without correction, the early updates would be systematically too small — the running averages haven't had enough steps to reflect the true magnitude of recent gradients. Adam corrects for this by dividing each moment estimate by a factor that accounts for the number of steps elapsed, ensuring that early training proceeds at the correct effective learning rate rather than an artificially depressed one. For large models, this has real effects on training stability in the first thousands of steps, which can set the trajectory for the entire run.

The paper is among the most cited in machine learning not because it introduced an optimizer but because it changed who could successfully train deep networks.


AdamW: The Fix That Matters for LLMs

Adam in its original form has a subtle but consequential flaw when applied to regularization. Understanding the flaw requires understanding why regularization matters.

Overfitting occurs when a model learns to perform well on training data but fails on new data it hasn't seen before — the model has memorized the training set rather than learning generalizable patterns. Weight decay is a standard technique for combating this: you add a penalty to the loss proportional to the magnitude of the model's weights, encouraging the optimizer to keep weights small and thus discouraging over-precise fitting of the training data. The larger the weights, the more the penalty pushes them back toward zero.

Here is the problem. L₂ regularization and weight decay regularization are equivalent for standard SGD, but as Loshchilov and Hutter demonstrated, this equivalence breaks down for adaptive gradient algorithms such as Adam. When you implement weight decay in Adam by adding an L₂ regularization term to the loss, the weight decay gradient gets scaled by Adam's second moment estimate — exactly as every other gradient does. Weights that tend to have large gradients in the loss function do not get regularized as much as they would with decoupled weight decay, because the gradient of the regularizer gets scaled along with the gradient of the loss function.

The regularization is not applied uniformly across parameters. Parameters with large gradients — often the most important, most heavily-used parameters in the network — receive proportionally less regularization than parameters with small gradients. The intended regularization effect is distorted in direct proportion to the adaptivity that makes Adam useful. The two mechanisms work against each other.

AdamW, introduced by Ilya Loshchilov and Frank Hutter in their 2017 paper "Decoupled Weight Decay Regularization," fixes this with a straightforward modification: apply weight decay directly to the parameters after the gradient update, not as a term in the loss. Weight decay shrinks the weights by a fixed fraction at each step, independent of the gradient magnitude. It is now orthogonal to the gradient update rather than entangled with it.

This decoupled modification substantially improves generalization performance, allowing AdamW to compete with SGD with momentum on image classification tasks where vanilla Adam had previously been outperformed. AdamW became the optimizer of choice for transformer training precisely because transformers have exactly the structure where this problem bites hardest: attention weights and embedding matrices receive enormous, dense gradient signals, while many feedforward components receive more modest signals. Vanilla Adam systematically under-regularizes the most important components of a transformer relative to intent. AdamW restores the intended regularization semantics.

Every major frontier model — GPT-4, the models trained at Anthropic, DeepSeek's R-series, Qwen 3 — uses AdamW or a variant of it. Both pre-training and mid-training phases employ AdamW with weight decay and gradient clipping as standard parts of contemporary training pipelines. The choice has become so normalized it rarely appears as a reported hyperparameter in technical papers. That normalization is the surest sign that the field has converged on something genuinely better than what came before.


Learning Rate Schedules and the Discipline of Warmup

Having chosen an optimizer, you must still specify how the learning rate changes over the course of training. A fixed learning rate is almost never optimal: it may be appropriate for the middle of training but wrong for both the beginning and the end.

The beginning of training requires special treatment because the model's parameters are initialized randomly. On the very first forward pass, the gradients computed by backpropagation reflect the loss of a random model on real data — a chaotic, high-magnitude signal that does not yet correspond to any meaningful structure in the input distribution. Applying a large learning rate at this moment causes the optimizer to take large, essentially random steps, potentially moving parameters into regions of the loss landscape that are difficult to recover from. This is the rationale for learning rate warmup: beginning training with a very small learning rate — sometimes orders of magnitude smaller than the target rate — and gradually increasing it over the first few thousand to tens of thousands of steps.

Warmup gives the model time to develop some initial structure in its representations before the optimizer starts taking aggressive steps. By the time the learning rate reaches its target value, the gradients are less noisy because the model has already organized itself enough to produce meaningful predictions. Training runs without warmup frequently diverge in the early steps or settle into trajectories that produce worse final performance, even if they appear to recover numerically from the initial instability.

After reaching the peak learning rate, the schedule must manage the decay toward the end of training. Cosine annealing has become the standard approach: the learning rate follows a smooth cosine curve from its peak value down toward zero as training concludes. Cosine annealing is preferred over step-wise decay — dropping the learning rate by a fixed factor at predetermined milestones — because it's smooth. Step decays cause visible discontinuities in the loss curve that can destabilize training if the step occurs at an inopportune moment. Cosine annealing provides a principled signal to the optimizer that the training run is concluding: take smaller and smaller steps as you converge, don't introduce large perturbations near the end.

The interaction between the cosine cycle length and the total number of training steps is non-trivial. Setting the cosine cycle length too much longer than the target number of training steps produces suboptimally trained models; an optimally trained model has the cosine cycle length correctly calibrated to the maximum number of steps given the compute budget. If you set the cycle length to twice the actual training duration, the model spends its final steps with a learning rate still halfway through its decay — not converged, not stable. The Chinchilla replication work found this produces measurably worse models, and helps explain why intermediate checkpoints overestimate final performance.

Batch size interacts with all of this in ways that aren't always intuitive. A larger batch size produces a more accurate gradient estimate on each step, which means you need fewer steps to traverse the same distance in parameter space — but each step costs proportionally more compute. The relationship between batch size and learning rate is empirically approximately linear: doubling the batch size allows roughly doubling the learning rate while maintaining comparable training dynamics. This is not a theorem, and it breaks down at very large batch sizes, but it is a useful working approximation. Frontier model training runs use enormous batches — often millions of tokens per step — not primarily for statistical reasons but because modern hardware (H100 and H200 GPU clusters, TPU pods) requires large batches to achieve high utilization. The optimization mathematics must be calibrated around that hardware reality.


Overfitting, Underfitting, and the Epistemic Discipline of Train/Val/Test

Optimization metrics and deployment metrics are not the same thing. A model can minimize training loss — achieve its optimization objective fully — while simultaneously failing to do what you want it to do. The train/validation/test split is the formal structure that makes this distinction tractable.

Underfitting occurs when the model is insufficiently expressive to capture the patterns in the data, or when training is stopped too early. The loss on training data remains high and the model's predictions are simply inaccurate. For large language models, this is the less common failure mode — a 70-billion-parameter transformer applied to text generation has more than enough capacity. Overfitting is the more consequential concern.

Overfitting for language models manifests differently than in classical machine learning. A GPT-style model trained until memorization of its training corpus is unlikely to be deployed in that state — training runs are far too expensive to overextend on purpose. But subtler forms of distributional misfitting are common. A model trained heavily on web text will perform well on web-like inputs and poorly on legal documents or scientific prose that have different stylistic signatures. A model trained on data scraped in 2024 will have miscalibrated beliefs about 2025 events. These are both forms of train-test distribution shift, a generalization of the overfitting problem.

The validation set serves as an online measurement of generalization during training. By periodically computing loss on held-out data the optimizer has never touched, you get an estimate of whether improvements in training loss are translating to genuine capability improvements or merely to better memorization. When validation loss starts increasing while training loss continues decreasing — the classic overfitting signature — the model is fitting noise rather than signal. In LLM practice, this inflection point is rarely reached because modern training runs are generally compute-limited rather than data-limited; Chinchilla's finding that most pre-2022 models were significantly undertrained means the typical failure mode is insufficient training, not overfitting.

The test set is reserved for a single evaluation after all training decisions — including hyperparameter choices guided by validation performance — have been finalized. The moment you make any decision based on test set performance, the test set is contaminated: your choices have been influenced by that data, and its estimate of generalization is biased. The discipline of maintaining a clean test set is an epistemic commitment to honest measurement. For benchmark evaluations of frontier models, contamination is a persistent concern: MMLU (Massive Multitask Language Understanding, a standard AI benchmark), GPQA (Graduate-Level Google-Proof Q&A, a benchmark of expert-level reasoning), SWE-bench (a benchmark of real-world software engineering tasks), ARC-AGI (the Abstraction and Reasoning Corpus, a benchmark designed to test novel reasoning) — all of these exist in text that models may have encountered during pretraining, and performance on them may partially reflect memorization rather than generalization. This is an active methodological challenge with no clean resolution.


What the Model Optimizes For Is Not What You Care About

The training loop optimizes a loss function. That loss function is a proxy for what you want — it is never exactly what you want. The gap between proxy and goal is where the deepest problems in AI development live.

Next-token prediction cross-entropy is a powerful proxy for general language competence. Over many batches and many documents, the model repeatedly learns which continuations are more plausible than others given the preceding context. It gradually captures syntax, semantics, style, factual associations, and longer-range patterns in language. The breadth of what a model learns through next-token prediction is remarkable — grammar, world knowledge, reasoning patterns — because language encodes so much of human knowledge and inference.

But cross-entropy on training text is not truthfulness. If the training corpus contains many confident declarative statements and few expressions of uncertainty, the model may learn to be overconfident in its outputs regardless of the actual uncertainty of a given prediction. This is particularly problematic for factual questions: a model may output false facts with the same high confidence as true facts, because the training objective never required distinguishing confident knowledge from uncertain speculation. Techniques like temperature scaling and label smoothing address calibration to some degree, but the fundamental disconnect between cross-entropy optimization and calibrated uncertainty remains a challenge.

Cross-entropy is not helpfulness. A model that predicts what text continuation is statistically most likely given a human query may produce something very different from the most useful response to that query. Harmlessness is not guaranteed either: the most likely continuation for certain prompts, given the distribution of internet text, is content no one would want a deployed system to produce. The entire field of alignment — reinforcement learning from human feedback (RLHF), Direct Preference Optimization (DPO), Constitutional AI, Group Relative Policy Optimization (GRPO) — exists because the transition from "minimizes next-token prediction loss" to "behaves the way we want it to" requires additional training stages with additional objectives, and those additional objectives are themselves proxies with their own gaps.

InstructGPT, the 2022 OpenAI paper that demonstrated RLHF at scale, showed that a model trained to maximize human preference ratings on its outputs behaved measurably better on intended tasks than a base model with lower cross-entropy loss. The same phenomenon appears in modern training pipelines: Claude 4 is not simply the largest language model that minimizes next-token prediction loss on the training corpus. It is a model that has been subjected to multiple post-pretraining alignment stages designed to bridge the gap between what next-token prediction produces and what human users find useful, accurate, and safe.

The loss function is where alignment begins and fails. Every decision about what to measure — cross-entropy versus human preferences versus constitutional adherence — is a decision about what the model will become. The optimizer will find the minimum of whatever surface you construct. Your job, if you are responsible for deploying or overseeing these systems, is to ensure that the minimum of your measured surface is close to the thing you care about.

That gap is never zero.

It may be small enough to ignore, or it may be large enough to matter enormously — and the only way to know which is to look at model behavior in deployment, not at training metrics. Understanding the training loop in mechanical detail means understanding exactly where that imprecision enters the system, and what it would take to reduce it. Gradient descent finds the minimum you describe, not the minimum you intend.


Next: Episode 2 — Pretraining at Scale: Data Pipelines, Compute Tradeoffs, and the Chinchilla Regime