M4E2: LoRA and QLoRA: How Fine-Tuning Actually Works in 2025
LoRA and QLoRA: How Fine-Tuning Works in 2025
The Wall That Was
There is a version of the fine-tuning story that circulates comfortably in AI briefings: pretraining is expensive and inaccessible, fine-tuning democratizes the technology, and parameter-efficient methods like LoRA (Low-Rank Adaptation) make it cheap enough for anyone. That story is not wrong, exactly. But it flatters itself by stopping before the harder question arrives.
Start with the wall it was describing. Full fine-tuning — updating every weight in a model — is, for frontier-scale models, categorically out of reach for most organizations. When you train a neural network, you need to store not just the model weights but the optimizer states and gradients for every parameter. For a model trained with Adam (the canonical optimizer), that means roughly four bytes of state per parameter in full precision. At 70 billion parameters, that obligation exceeds 560 gigabytes before you have processed a single training example. The H100s required to hold that in memory simultaneously, parallelized across enough nodes to make the job tractable within a reasonable time window, represent a hardware budget that only a handful of organizations in the world routinely deploy. The GPT-3 era made this concrete: deploying independent instances of fine-tuned models, each with 175 billion parameters, is prohibitively expensive.
The gradient computation is the cruelest part of this constraint, because it means you cannot simply load the inference weights and train on top — you need a complete second copy of the parameter space in optimizer memory. This is the situation that made the question "can we fine-tune this model?" feel genuinely threatening in 2022. LoRA, published by Hu et al. in 2021, systematically dismantled it.
The paper's central insight was geometric. Taking inspiration from earlier work showing that learned over-parametrized models reside on a low intrinsic dimension, the authors hypothesized that the change in weights during model adaptation also has a low intrinsic rank. If the weight updates that matter during fine-tuning live in a low-dimensional subspace of the full weight matrix, you do not need to compute or store the full gradient. You only need to find that subspace and optimize within it.
The mechanism that follows from this hypothesis is elegant. LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Concretely: instead of updating a weight matrix W of dimension d×d, you parameterize the update as the product of two smaller matrices — A of dimension d×r and B of dimension r×d, where r is the rank hyperparameter and r is much smaller than d. The pretrained W is frozen. Only A and B are trained. At inference time, you can merge W + BA back into a single matrix, incurring no additional latency. Earlier adapter-based approaches inserted new layers into the forward pass and therefore slowed every inference call. LoRA's learned weights merge directly with the main weights at inference, eliminating that cost entirely.
Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. The order-of-magnitude figures here reflect genuine structural reduction in the parameter space being optimized, not marketing rounding. A rank-16 LoRA applied to the query and value projection matrices of a 70B-parameter model might involve fewer than 100 million trainable parameters. The frozen base model stays in memory; the gradient computation operates only on A and B. LoRA performs on-par or better than full fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters and higher training throughput.
The rank hyperparameter r is itself revealing. In practice, ranks between 4 and 64 cover the vast majority of fine-tuning use cases. The LoRA paper's own empirical investigation found that pushing to higher ranks did not consistently improve performance — which, if you accept the low intrinsic rank hypothesis, makes sense. The information content of task adaptation is compressible into a surprisingly small basis. This has a practical implication that practitioners often discover by accident: when a LoRA run with rank 64 fails to outperform one with rank 8, the failure is rarely a sign that rank is the wrong variable to increase. The data or objective is almost always the binding constraint.
PEFT: The Umbrella Category and Its Inhabitants
Parameter-Efficient Fine-Tuning — universally abbreviated as PEFT — refers to the family of methods that update only a small subset of a model's parameters, or add a small number of new parameters, while leaving the majority of pretrained weights frozen. The PEFT family exists because full fine-tuning is computationally prohibitive at scale and risks catastrophic forgetting of general capabilities. What unifies PEFT methods is their objective: achieve task-specific adaptation without paying the full cost of retraining.
The landscape of PEFT methods spans a spectrum of approaches, each making different tradeoffs between compute efficiency, inference latency, expressivity, and ease of implementation. Adapter layers insert small bottleneck modules between transformer sublayers. Prefix tuning and prompt tuning prepend trainable vectors to the input sequence. LoRA injects low-rank decomposition matrices into the weight updates of attention layers. QLoRA combines LoRA with aggressive base model quantization. These are not interchangeable. Each method's architectural choice determines both its ceiling and its limitations in ways that matter for practical deployment decisions.
The Hugging Face PEFT library (Hugging Face being the leading open-source AI platform, and PEFT being their unified library for parameter-efficient methods) has become the de facto standard implementation environment for these methods, providing unified access to most of them under a consistent API. That convenience has had a side effect: the methods appear more similar than they are.
QLoRA: From Expensive to Accessible
LoRA moved the question from hyperscaler to anyone with a multi-GPU node. QLoRA, published by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer in 2023, moved it again — from multi-GPU to single-GPU, and in some configurations to consumer hardware. QLoRA reduces memory usage enough to fine-tune a 65B parameter model on a single 48GB GPU while preserving full 16-bit fine-tuning task performance. Read that carefully. The A100-80G, which costs roughly $10,000–15,000 on the secondary market and rents for several dollars per hour, can now fine-tune models that previously required coordinated clusters.
The mechanism stacks three innovations. QLoRA introduces 4-bit NormalFloat (NF4), a new data type that is information-theoretically optimal for normally distributed weights; double quantization to reduce the average memory footprint by quantizing the quantization constants themselves; and paged optimizers to manage memory spikes. Each deserves examination, because together they represent a sophisticated systems argument, not just a compression hack.
NF4 is the most theoretically interesting piece. Neural network weights, after pretraining, are approximately normally distributed around zero. Standard integer quantization schemes distribute quantization bins uniformly across the range of values — which wastes bins on the tails of the distribution where few weights live. NF4 instead uses quantile quantization derived from the normal distribution, placing bins where the weight density is highest and thereby achieving better fidelity per bit for the values that matter most. To make this concrete: a standard 4-bit integer scheme allocates its 16 representable values at equal intervals across the numerical range. If most weights cluster between -0.5 and 0.5, many of those 16 bins are wasted on ranges like [-2.0, -1.5] where virtually no weights reside. NF4 eliminates that waste by mapping the 16 representable values to the quantiles of a standard normal distribution, ensuring each bin covers an equal share of the actual weight density rather than an equal share of the numerical range.
Double quantization addresses a subtle cost. Any quantization scheme requires storing scaling constants — the factors that map quantized values back to their original range. Those constants themselves consume memory. QLoRA quantizes those constants as well, applying a second round of quantization to achieve roughly 0.127 bits of overhead per parameter rather than the 0.5 bits that 32-bit constants per 64-parameter block would require.
Paged optimizers handle memory spikes during training. Optimizer state — particularly for Adam, which maintains first and second moment estimates for every trainable parameter — can temporarily exceed available GPU memory during certain gradient accumulation patterns. QLoRA uses NVIDIA's unified memory system to page optimizer states between GPU and CPU memory, transparently handling these spikes without causing out-of-memory failures. The base model, quantized to 4-bit NF4, is dequantized to bfloat16 for each forward pass computation — all actual arithmetic happens in high precision. Only storage is at 4-bit.
The best model family produced by QLoRA, named Guanaco, outperforms all previously released open models on the Vicuna benchmark, reaching 99.3% of ChatGPT's performance level while requiring only 24 hours of fine-tuning on a single GPU. The Guanaco results were striking not because 99.3% of ChatGPT quality is an unambiguous standard, but because they demonstrated empirically that the NF4 quantization error introduced by freezing the base model was not accumulating into task degradation. NF4 with double quantization fully recovers the 16-bit LoRA performance on MMLU (Massive Multitask Language Understanding, a standard AI benchmark). The technique is not sacrificing measurable capability at the benchmark level for the memory savings it provides.
The downstream consequence is what practitioners mean when they speak about democratization in a non-trivial sense. A single 24GB graphics card can fine-tune a 33B LLaMA model — meaning a consumer-grade NVIDIA RTX 4090 or 3090 is sufficient. The Hugging Face PEFT library, which integrates QLoRA via the `bitsandbytes` backend, reduced the configuration to a handful of lines. LLaMA-3, Mistral, Qwen, DeepSeek — the open-weight ecosystem that has expanded dramatically since 2023 — can all be fine-tuned on hardware that university research groups and small companies can acquire. The question of whether your organization can technically fine-tune a large model collapsed, in the span of roughly two years, from a central strategic concern to a trivially answered prerequisite.
To put the memory savings in concrete comparative terms: full fine-tuning of a 65B model in 16-bit precision requires approximately 780GB of GPU memory for weights plus optimizer states. Standard LoRA on the same model at 16-bit still requires roughly 120–150GB for the frozen weights alone, necessitating a multi-GPU setup. QLoRA compresses the frozen base to under 40GB through 4-bit quantization, then adds only the small adapter parameters on top — bringing the total well within a single 48GB GPU. From 780GB to 150GB to 48GB. That is the quantitative story of what happened to the hardware barrier between 2021 and 2023.
The Alternatives and Their Honest Tradeoffs
Positioning LoRA and QLoRA accurately requires a precise look at where they fit in the broader PEFT landscape, because the comparison reveals the tradeoffs practitioners keep encountering.
Adapter layers, the oldest of the PEFT approaches, insert small bottleneck modules between transformer sublayers — a down-projection to a lower dimension, a nonlinearity, and an up-projection back to full dimension. The original adapter paper by Houlsby et al. (2019) demonstrated that inserting adapters with as few as 3.6% of a BERT model's parameters could match full fine-tuning on GLUE benchmarks (the General Language Understanding Evaluation suite, a standard collection of NLP tasks). Adapters established the empirical case that task-relevant information could be captured in a tiny fraction of parameter space — the foundational premise LoRA later refined. But adapters work by inserting new computation into the forward pass. The down-projection and up-projection operations execute sequentially for every token at every adapted layer, and because these operations cannot be algebraically merged with the frozen weights they sit alongside, they add fixed latency per token that scales with the number of adapted layers and the adapter bottleneck dimension. At high-throughput serving scales, this latency accumulates. LoRA's key architectural advance over adapters was precisely the mergeable property: once training is complete, the BA product is added directly into W, and the forward pass sees a single matrix rather than a sequential chain of operations. The fine-tuned model runs at exactly the same speed as the base model.
Prompt tuning and prefix tuning operate differently. Rather than modifying the model's weight structure, they prepend a small number of trainable continuous vectors to the input sequence. Prefix tuning is difficult to optimize and its performance changes non-monotonically with the number of trainable parameters. More fundamentally, reserving part of the sequence length for adaptation necessarily reduces the sequence length available to process the actual task. Every token consumed by a prefix is a token unavailable to the input. This becomes significant for long-context use cases. Prompt tuning also tends to underperform on tasks requiring significant behavioral modification rather than mild steering. It works best when the model already knows how to do the task and needs to do it in a particular style. A useful rule of thumb: prompt tuning is appropriate when you want to specialize a model that is already highly capable on a task type but needs to adopt a specific persona, format, or domain vocabulary. When the model lacks the underlying capability you need, no amount of soft prompt tokens will supply it — you will not teach a model arithmetic it did not learn during pretraining.
The decision logic among these approaches is cleaner than the literature sometimes makes it appear. Full fine-tuning is warranted when you have hyperscaler-grade compute, when the task is genuinely distant from the pretraining distribution in a way that requires deep weight modification across the entire network, or when you need maximum expressivity for a production model serving millions of requests against a single task. LoRA handles the broad middle ground: instruction tuning, domain adaptation, behavioral modification, style transfer. For almost all organizational fine-tuning efforts, it is the correct default. QLoRA extends LoRA into the memory-constrained regime — the same task, but when GPU count is low. The rank hyperparameter r is the primary dial: lower ranks (4–8) for simple behavioral changes, higher ranks (32–64) for substantial domain shifts or tasks requiring richer adaptation subspaces.
LoRA underperforms in settings that resemble pretraining — specifically those with very large datasets that exceed the storage limits of LoRA parameters. For dataset sizes typical in post-training, LoRA has sufficient capacity to store the essential information. LoRA is not a pretraining substitute. It assumes that the base model already contains most of the required knowledge and capability, and that fine-tuning is steering or sharpening, not teaching from scratch.
The following comparison summarizes where each approach belongs in a practical decision framework:
| Method | Trainable Parameters | Inference Latency vs. Base | Minimum GPU Memory (70B model) | Best For |
|---|---|---|---|---|
| Full fine-tuning | 100% | None | ~560GB+ | Maximum expressivity, large-scale task shift |
| Adapter layers | 1–5% | Added per layer | ~140GB (16-bit) | Multi-task serving with hot-swappable adapters |
| Prompt/prefix tuning | <1% | Context length cost | ~140GB (16-bit) | Style/format steering on capable base models |
| LoRA | 0.1–1% | None (post-merge) | ~140GB (16-bit) | General fine-tuning default |
| QLoRA | 0.1–1% | None (post-merge) | ~40GB (4-bit base) | Single-GPU or memory-constrained fine-tuning |
Instruction Tuning: The Data Side of the Equation
Understanding the fine-tuning landscape requires understanding instruction tuning as a category — not just as a technique but as a design problem with its own failure modes. Instruction tuning refers specifically to fine-tuning a language model on examples structured as instruction-input-output triples, with the objective of teaching the model to follow natural-language directives reliably. Pretraining on web text teaches a model to complete sequences, not to follow instructions. Instruction tuning bridges that gap by showing the model what "following an instruction" looks like in training data.
The earliest large-scale demonstration was FLAN (Fine-tuned LAnguage Net, Wei et al. 2021), which showed that fine-tuning a 137B model on 62 instruction-formatted NLP tasks produced a model that could generalize zero-shot to held-out tasks better than the base model. What made FLAN significant was not the compute involved but the insight that formatting matters: the same underlying capabilities packaged as instruction-following examples transferred more readily than raw pretraining text.
The Alpaca and Dolly datasets that followed democratized instruction tuning by providing open datasets, but they also introduced the quality problems that continue to plague organizational fine-tuning efforts. Alpaca's 52,000 GPT-3.5-generated examples contain a non-trivial rate of factual errors, formatting inconsistencies, and responses that optimize for length and hedging over accuracy — qualities that transfer directly to models trained on them. The OASST1 dataset used in Guanaco was more carefully filtered, which is part of why QLoRA's empirical results held up: the data quality, not just the method, was part of the experimental design.
In 2025, the standard for instruction tuning datasets has advanced considerably. ShareGPT-derived datasets, which capture real human-Claude and human-GPT conversations, have largely replaced synthetic self-instruct data for general instruction tuning because real conversational turns expose the model to a more diverse range of human intents and repair strategies. For domain-specific instruction tuning — legal, medical, financial, scientific — current practice involves constructing datasets from domain documents using structured extraction pipelines, then filtering on multiple quality signals including response length distribution, instruction diversity measured by embedding-space coverage, and consistency of annotation if human labelers are involved.
Speculative Decoding: The Inference Side of the Equation
The fine-tuning conversation typically stops at training, but for organizations deploying fine-tuned models at scale, inference costs are where the economics resolve. Speculative decoding is the most consequential inference efficiency technique of the current period, worth understanding alongside LoRA precisely because the two address different phases of the same operational problem.
The core constraint in autoregressive LLM inference is sequential: generating K tokens requires K full forward passes through the model, each depending on the previous token. For a 70B parameter model generating 500 tokens, that is 500 sequential full-model passes. No amount of GPU compute helps because each pass depends on the previous token. This is a memory bandwidth problem, not a compute problem — modern GPUs sit partially idle waiting for weights to load from DRAM during inference.
Speculative decoding breaks the bottleneck by using small, fast draft models to propose multiple tokens that a larger target model verifies in parallel, achieving 2–3x speedup without changing output quality. The output distribution guarantee is mathematically exact. The target model's verification step samples from an adjusted distribution that provably matches what the target model would have generated autoregressively. This is a guarantee, not an approximation.
The draft model proposes K tokens simultaneously. The target model then scores all K proposals in a single parallel forward pass — exploiting the GPU's ability to process a batch of tokens at once. If the draft model's acceptance rate averages 60% and proposes 8 tokens, each verification pass produces approximately 5 tokens versus 1 without speculation. The effective gain scales with that acceptance rate: how often does the draft model predict what the target would have predicted? This rate varies significantly by task. Predictable completions like code with clear patterns, formal writing, or structured data typically yield acceptance rates in the 0.75–0.85 range. Conversational generation is more variable. Creative generation, where the target model's predictions are themselves high-entropy, is where speculative decoding delivers the least gain.
The technique matured from research curiosity to production standard in 2025. Both vLLM and TensorRT-LLM (the two dominant open-source LLM serving frameworks) include native speculative decoding support, with NVIDIA demonstrating 3.6x throughput improvements on H200 GPUs. The EAGLE family of draft approaches, which train lightweight heads that reuse the target model's own internal representations rather than running a fully separate model, achieves particularly high acceptance rates. EAGLE3 generally wins on acceptance rate and memory efficiency for this reason. An external draft model requires loading a second full model into GPU memory and sees lower acceptance rates on most tasks.
The connection to fine-tuning is more direct than it initially appears. When you fine-tune a model with LoRA on domain-specific data, you may also need a fine-tuned draft model to maintain high acceptance rates. A draft model trained on generic internet text will align poorly with a target model that has been heavily instruction-tuned for, say, legal document analysis — the draft will propose tokens that the specialized target consistently rejects, and the acceptance rate will drop below the threshold where speculative decoding provides net benefit. Architecture alignment matters: draft models from the same family as targets achieve higher acceptance. Llama 3.2-1B drafting for Llama 3.1-70B outperforms generic small models because training data and tokenization align. Fine-tuning the draft on the same domain distribution as the target, using QLoRA to keep the cost manageable, is increasingly standard practice in production serving pipelines. The economics are favorable: a QLoRA run on a 1B–3B draft model costs a fraction of the corresponding target fine-tune, and the acceptance rate improvement it produces can reduce serving costs by 30–50% at moderate traffic volumes — often recouping the fine-tuning cost within days of production deployment.
Where the Democratization Story Breaks Down
Here is where the clean narrative requires friction.
LoRA has made the question of whether you can fine-tune a large model nearly irrelevant. QLoRA combines LoRA's parameter efficiency with 4-bit quantization, enabling fine-tuning of 65B parameter models on a single 48GB GPU while preserving full 16-bit task performance, compressing memory requirements from more than 780GB to less than 48GB without degrading performance. The compute barrier is, for practical purposes, gone. And that is exactly where the harder problem becomes visible.
What almost no organization gets right on first attempt is the data. The quality, the composition, the labeling consistency, and above all the alignment between the fine-tuning objective and the actual deployed task. These are the dimensions that predict whether a fine-tuned model will outperform the base model in production, and they have nothing to do with whether you used rank 8 or rank 16.
Consider instruction tuning, the most common fine-tuning task for organizations deploying conversational AI. The Alpaca dataset — 52,000 examples generated by GPT-3.5 via self-instruct — was instrumental in demonstrating that small, high-quality instruction sets could meaningfully improve model behavior. The Guanaco model in the QLoRA paper used OASST1, a filtered subset of the Open Assistant dataset. QLoRA fine-tuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous best. But "high quality" is doing significant work in that sentence. OASST1 data was curated through multiple rounds of human annotation. Alpaca has well-documented quality issues — factual errors, inconsistent formatting, instruction-response pairs that reward verbosity over accuracy. Models fine-tuned on Alpaca often exhibit the behavior practitioners call "Alpacaism": verbose, hedged, somewhat pompous responses that sound like a language model doing its best impression of a helpful assistant rather than actually being one.
The catastrophic forgetting problem adds another layer of difficulty. When you fine-tune a model aggressively on a narrow domain, it can lose capabilities from the general pretraining distribution that you did not intend to modify. Many instruction-tuning efforts found that a few epochs with a low learning rate is enough to align a model; any more and you start degrading quality on other tasks. LoRA is somewhat protective against this, because the frozen base weights preserve pretraining knowledge and only the low-rank adapters are modified. But even within LoRA, overfitting on a small or homogeneous dataset can cause the adapter to overwhelm the base model's signal in the regions of weight space where the task data is concentrated.
A concrete example: organizations that fine-tune customer service models exclusively on support ticket transcripts — a genre characterized by specific vocabulary, short exchanges, and frequently repeated resolution patterns — often find that the resulting model handles on-script queries flawlessly while becoming notably worse at paraphrasing, summarization, or any task requiring the broader linguistic flexibility the base model had. The domain narrows the model in ways the fine-tuning evaluation, which tests the same support ticket distribution, never reveals.
The objective selection problem is arguably harder than the data selection problem. When an organization decides to fine-tune a model, it typically specifies a supervised objective — optimize the cross-entropy loss on demonstration examples of the desired behavior. The desired behavior in production is rarely fully captured by those demonstrations. A model fine-tuned to imitate the outputs of expert analysts might produce fluent, analyst-sounding text without having learned the judgment those analysts were applying. The form gets transferred; the reasoning does not.
This is why RLHF (reinforcement learning from human feedback) and its derivatives — DPO (Direct Preference Optimization), GRPO (Group Relative Policy Optimization), the training methods that power frontier models like Claude 4 and GPT-5 — add a preference signal on top of supervised fine-tuning. The preference signal is the mechanism by which the training objective gets reconnected to the actual human judgment you care about. LoRA and QLoRA are fully compatible with these objectives; GRPO can backpropagate through LoRA adapters just as easily as through supervised loss. But choosing the right objective requires understanding what you are trying to teach the model to do. That is a harder question than any infrastructure decision.
The Question That Replaced the Question
When picking the optimal learning rates for each setting, training progresses in an almost identical way for LoRAs with different sizes and full fine-tuning. That finding, from Thinking Machines Lab's 2025 analysis comparing LoRA and full fine-tuning across Llama-3 and Qwen3 models, captures something important about where the field has arrived. The parameter-efficiency methods work. They have been validated on MMLU, on coding benchmarks, on reasoning tasks. Both LoRA and full fine-tuning runs develop advanced reasoning behaviors such as backtracking, self-verification, and in-context exploration, visible in the lengthening of model chain-of-thought outputs. The methodological gap between compute-rich and compute-constrained fine-tuning has largely closed.
The judgment gap has not.
Every organization that deploys a fine-tuned model eventually confronts the same set of questions that no PEFT method can answer. What data did we fine-tune on, and is it representative of the production distribution? Did we measure the right things in evaluation? Did the fine-tuning improve task performance, or did it improve the model's ability to produce outputs that superficially resemble our training examples? Is the behavioral change we observed in development durable against the distribution shift that will occur when real users interact with it?
Current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots — this finding from the QLoRA paper itself, published in 2023, has only grown more relevant. Benchmark saturation has accelerated. MMLU, which was a challenging evaluation for GPT-3, is no longer meaningfully discriminative for frontier models. GPQA Diamond, SWE-bench Verified, ARC-AGI (the Abstraction and Reasoning Corpus, a benchmark designed to test novel reasoning) — the benchmarks keep advancing to stay ahead of the models, but organizational fine-tuning tasks rarely align cleanly with any of them. You are often left building your own evaluation, which requires knowing what the right evaluation questions are, which requires understanding the task deeply enough that you might wonder whether you needed the model for it in the first place.
The parallel to speculative decoding is apt. In an environment where verification cost is constant, the primary bottleneck is no longer how many tokens systems can check, but how accurately they can predict them. The same logic applies to fine-tuning. The infrastructure for training — doing the compute, running the gradient descent, merging the adapters — is essentially a commodity. The bottleneck has moved upstream, to the quality of the signal you are training on and the clarity of the behavior you are trying to produce.
A Framework for the Strategic Question
The compute question has been answered. The field now needs frameworks for the questions that replaced it. What follows is not a recipe — fine-tuning decisions are too context-dependent for recipes — but a set of diagnostic questions organized in the order they need to be resolved.
The first question is capability versus alignment. Is the base model already capable of performing the task when given a perfect prompt, and you need it to do so reliably without that prompt? Or does it genuinely lack the underlying capability the task requires? This distinction determines whether fine-tuning is the right intervention at all. If a model can solve your problem with careful prompting, the case for fine-tuning rests on latency, cost, and consistency — not on capability addition. If the model cannot solve the problem regardless of prompting, fine-tuning on demonstrations of the correct behavior may or may not help, depending on whether the required capability is present in latent form in the pretraining distribution or genuinely absent.
A 7B model that has never encountered clinical pharmacology in pretraining will not acquire it through instruction tuning on 10,000 drug interaction examples. You will get a model that speaks confidently in clinical register without the underlying pharmacological reasoning. A 70B model that has seen extensive medical literature but does not spontaneously apply it in the right format can be reliably shaped with far fewer examples.
The second question is distribution coverage. Can you characterize the production input distribution precisely enough to construct a training set that covers it? The production distribution for a customer-facing model is not the distribution of examples your subject matter experts consider representative — it is the distribution of what users will type, which includes edge cases, adversarial phrasings, ambiguous requests, and task combinations that no expert would generate in a controlled dataset construction exercise. Organizations that construct fine-tuning datasets entirely from expert-generated examples, without sampling from real or simulated user interactions, routinely find that their fine-tuned model handles the expert-designed cases beautifully and degrades on the queries it receives in deployment. The mitigation is to build evaluation sets from production-representative data before fine-tuning begins, not after — so that the evaluation exists independently of the training data and can detect distribution mismatch rather than simply confirming that the model learned the training distribution.
The third question is objective validity. Is the loss function you are optimizing a valid proxy for the outcome you care about? Supervised cross-entropy loss rewards the model for producing the next token in your training examples. If your training examples are high-quality demonstrations of the desired behavior, this is a reasonable proxy. Cross-entropy loss does not penalize confident incorrectness, does not reward calibrated uncertainty, and does not capture the preference ordering among multiple plausible responses. These gaps are precisely what preference optimization methods — DPO, RLHF, GRPO — are designed to address.
The practical question for any fine-tuning project is whether the behavioral gap between supervised fine-tuning and the actual desired behavior is small enough to ignore, or large enough to warrant the additional complexity of preference data collection and preference optimization training. For tasks where the acceptable response space is narrow and well-defined — format conversion, classification, structured extraction — supervised fine-tuning typically suffices. For tasks where quality is a matter of judgment and multiple responses would be plausible but some are meaningfully better — summarization, reasoning, open-ended generation — supervised fine-tuning on demonstrations alone is frequently insufficient, and the gap becomes visible in production.
What 2025 Has Clarified
The 2025 picture of fine-tuning is one of mature infrastructure meeting immature practice.
The tools are stable. Hugging Face's PEFT library, Unsloth (a memory-optimized LoRA training library), Axolotl (a configuration-driven fine-tuning pipeline), and Modal or RunPod (cloud GPU platforms) have collectively reduced the time from "I want to fine-tune this model" to "I have a running training job" to under an hour for a practitioner with basic familiarity. The open-weight model ecosystem — Llama 3.3, Qwen 2.5, Mistral Small, DeepSeek-V3, Gemma 3 — provides base models of sufficient quality that fine-tuning on even modest domain-specific datasets produces measurable improvement for specialized tasks. Hardware costs have continued to fall in real terms as H100 and A100 cloud availability has expanded and consumer GPUs have increased in VRAM.
The evaluation problem remains unsolved. The proliferation of capable base models has made it easy to produce fine-tuned models that score well on internal evaluations and disappoint in deployment, because the internal evaluation was never validated as a proxy for deployment performance. The organizations that have developed durable fine-tuning pipelines — those where fine-tuned models consistently outperform base models on the actual task — share a common practice: they build the evaluation before they build the training set, they treat the evaluation as the primary engineering artifact, and they iterate on data and objective before they iterate on LoRA rank or learning rate.
The parameter choices are, at this point, nearly irrelevant in comparison. A rank-8 LoRA trained on carefully curated, production-representative data with a well-specified objective will outperform a rank-64 LoRA trained on carelessly assembled data with a misspecified objective in virtually every real-world case the industry has documented.
The organizations that understand this will treat data curation and objective design with the same rigor they once reserved for infrastructure decisions. The organizations that do not will run QLoRA jobs on increasingly large datasets, evaluate on held-out splits of those same datasets, declare success, and then encounter the production distribution and find that their fine-tuned model is confidently wrong in the specific ways their training data made possible.
The compute problem was solved. The judgment problem is still yours to solve.
The hardest insight to operationalize: a fine-tuned model that scores well on your internal evaluation but fails in deployment has not failed because of the fine-tuning method. It has failed because the evaluation was not a valid proxy for the deployment task. That gap — between what you measured during training and what you actually needed — is the real engineering problem that LoRA and QLoRA, for all their genuine power, leave entirely untouched.