Transfer Learning and Fine-Tuning
Training GPT-3 from scratch cost an estimated four to twelve million dollars in compute. Very few organizations can afford this. Yet GPT-3's descendants power applications at thousands of companies. How? Through transfer learning: the idea that knowledge learned while solving one task on one dataset can be repurposed for different tasks on different datasets — often with a fraction of the original effort. Transfer learning is the reason a small team with a modest GPU budget can build a state-of-the-art image classifier or chatbot today.
What Pretraining Learns
A model pretrained on a large general dataset develops internal representations that are broadly useful. Consider a CNN pretrained on ImageNet (1.2 million labeled photographs of 1000 categories). The early layers learn general-purpose edge detectors and texture filters. Middle layers learn shapes and object parts. Late layers learn category-specific features. This hierarchy is not specific to ImageNet — it reflects the structure of natural images in general. When you take this pretrained CNN and apply it to a new task — say, classifying skin lesions in dermatology photographs — the early and middle layer representations are still relevant: skin lesions have edges, textures, and shapes. You do not need to relearn how to detect edges from scratch. For language models, pretraining on large text corpora teaches the model grammar, facts about the world, reasoning patterns, and how language is used in context. A model pretrained on Wikipedia and books has implicitly learned that Paris is the capital of France, that arguments have premises and conclusions, that code has structure. These capabilities transfer to downstream tasks: summarization, question answering, classification. The formal justification for transfer learning is that the source task and target task share underlying structure — features useful for one are useful for the other. When this assumption holds, transfer is powerful. When it does not — when source and target domains are radically different — transfer may fail or even hurt performance (negative transfer).
Pretraining can be understood as learning a feature extractor — a function from raw input to a rich internal representation. Fine-tuning then trains a relatively simple head (often a single linear layer) on top of those features. The expensive, data-hungry part of learning is done once and shared; the cheap, task-specific part is done per application.
Fine-tuning strategies span a spectrum from minimal to extensive parameter changes. Feature extraction (frozen backbone): The pretrained model's weights are frozen entirely. The model acts as a fixed feature extractor — inputs are passed through it to produce embeddings, and only a new output head (a small classifier or regressor) is trained from scratch on the target task. This is appropriate when the target dataset is very small (hundreds of examples), because updating the pretrained weights with so little data risks overfitting. Full fine-tuning: All pretrained weights are updated during training on the target task, using a small learning rate (typically 10 to 100 times smaller than the pretraining learning rate) to avoid overwriting useful representations. This is appropriate when the target dataset is large enough (tens of thousands of examples or more) and the target task is somewhat different from the pretraining task. Layer-wise fine-tuning: Intermediate approach — freeze early layers (which contain general features) and fine-tune later layers (which contain task-specific features). The number of frozen layers is a hyperparameter to tune. Parameter-efficient fine-tuning (PEFT): As models grew to billions of parameters, full fine-tuning became impractical even for well-resourced teams — storing and updating 70 billion parameter gradients is expensive. PEFT methods adapt models while updating only a small fraction of parameters. LoRA (Low-Rank Adaptation) is the most widely adopted PEFT method. The key observation: the weight updates during fine-tuning tend to be low-rank — most of the change in the weight matrix can be approximated by the product of two small matrices. LoRA freezes the original weight matrix W and adds a trainable low-rank decomposition: W' = W + A × B, where A has shape (d × r) and B has shape (r × k), with rank r much smaller than d or k. Training only A and B (perhaps 0.1% of total parameters) achieves 90-95% of the performance of full fine-tuning on many tasks. Instruction tuning is a form of fine-tuning specific to language models: training on examples of (instruction, output) pairs teaches the model to follow natural language directions. RLHF (Reinforcement Learning from Human Feedback) adds a further stage where human preferences between outputs guide the model toward more helpful, accurate, and harmless responses — the technique behind ChatGPT's alignment.
Flashcards — click each card to reveal the answer
Practical Decision-Making
Choosing a fine-tuning strategy requires answering three questions. How much target data do you have? With fewer than 1000 labeled examples, freeze most of the backbone. With tens of thousands, consider full or layer-wise fine-tuning. With millions, you may benefit from training from scratch if your domain is highly specialized. How different is the target domain from the pretraining domain? A model pretrained on English text will transfer well to English summarization (similar domain) but poorly to Swahili poetry (different language) or electrocardiogram classification (entirely different modality). Domain shift is the primary enemy of transfer learning. What are your compute and storage constraints? Each fine-tuned version of a 70-billion-parameter model requires 70 billion parameters of storage — impractical if you need dozens of task-specific variants. LoRA adapters are orders of magnitude smaller and can be swapped at inference time without changing the base model. One critical caution: fine-tuning does not remove the model's limitations. A language model that hallucinates facts during pretraining will continue to hallucinate during fine-tuning unless the fine-tuning data specifically targets this. Fine-tuning shapes behavior but does not add knowledge that is absent from pretraining.
A model that confidently states false information during pretraining will often do the same after fine-tuning. Fine-tuning adjusts style, format, and task-specific behavior — it does not reliably instill factual accuracy. Retrieval-augmented generation (grounding outputs in retrieved documents) is a more reliable remedy for factual hallucination.
A research team has 500 labeled medical images and a CNN pretrained on 1.2 million natural photographs. Which fine-tuning strategy is most appropriate?
LoRA trains matrices A (d×r) and B (r×k) where r is much smaller than d and k. Why does this dramatically reduce the number of trainable parameters?
Design a Transfer Learning Pipeline
- Step 1. Choose one of these target tasks: (a) classifying satellite images of land use, (b) detecting spam in SMS messages, (c) identifying plant diseases from leaf photographs.
- Step 2. Identify a publicly available pretrained model that could serve as your backbone. Justify the choice: what did it pretrain on, and why does that transfer?
- Step 3. Estimate how much labeled target data you would realistically be able to collect. Based on this, decide between feature extraction, LoRA, or full fine-tuning. Justify the choice.
- Step 4. Identify the biggest domain shift risk: in what way does your target domain differ from the pretraining domain?
- Step 5. Describe one specific failure mode you would test for before deploying, and how you would detect it.