Skip to main content
Frontier & Future AI

⏱ About 20 min20 XP

Scaling Laws

One of the most consequential empirical discoveries in AI research is also one of the simplest to state: when you make a model bigger, train it on more data, or spend more compute, it gets better, and the improvement follows a predictable mathematical relationship. This is a scaling law. Scaling laws transform AI development from a craft guided by intuition into something closer to an engineering discipline. If you can predict, before spending millions of dollars on a training run, approximately how well the model will perform, you can make rational resource allocation decisions. You can decide whether to invest in more data, a larger model, or a longer training run. You can compare the cost-effectiveness of different architectural choices. Scaling laws gave AI labs a map where previously they had only compass bearings.

The Kaplan et al. Findings (2020)

In 2020, a team at OpenAI led by Jared Kaplan published a landmark paper establishing scaling laws for neural language models. Their key finding: the test loss of a language model follows a power-law relationship with each of three independent variables: model size N (number of parameters), dataset size D (number of tokens), and compute budget C (floating-point operations used for training). A power law means that if you plot performance against scale on a log-log graph, you get a straight line. Concretely, each order-of-magnitude increase in model size produces a consistent, predictable improvement in loss, regardless of the starting scale. Double the scale again and you get the same amount of improvement. The Kaplan team found that model size was the dominant factor: given a fixed compute budget, you should use most of it to make the model larger while training it on relatively less data. This finding directly shaped GPT-3: a very large model of 175 billion parameters trained on a relatively modest 300 billion tokens. Equally important was what did not matter: architecture choices within the transformer family, such as the ratio of layers to width or number of attention heads, had surprisingly little effect on the scaling law. Scale, not architecture tweaks, was the primary lever.

A Power Law in Practice

If performance scales as a power law with model size, then every 10x increase in parameters buys you the same absolute improvement in loss regardless of starting scale. You never run out of benefit from scaling, but the benefit per dollar of compute stays roughly constant. This is why AI labs have continued investing in ever-larger training runs for half a decade.

In 2022, DeepMind researchers published a paper colloquially called the Chinchilla paper that refined the Kaplan findings with a critical correction: for a fixed compute budget, the recommendation to maximize model size while undertraining on data was not optimal. The Chinchilla team showed that model size and training tokens should scale together roughly equally. For every doubling of model parameters, you should approximately double the training tokens. The Chinchilla key result: a 70-billion parameter model trained on 1.4 trillion tokens outperformed a 280-billion parameter model trained on only 300 billion tokens, despite using four times fewer parameters. The larger model was undertrained on data. This finding reshaped the field. Most large models circa 2021-2022 including GPT-3 were overparameterized relative to their training data. They could achieve lower loss by training smaller models on much more data. The practical implication for deployment: the optimal model for real-world use is smaller than maximally compute-efficient because inference cost, the cost of running the model after training, also scales with model size. Post-Chinchilla, the industry converged on a different trade-off: train models smaller than maximally compute-efficient but on far more data than Chinchilla recommends, producing models that are cheaper to run at deployment while still highly capable. Llama 3 and Mistral models exemplify this philosophy.

Match each scaling law concept to its correct description.

Terms

Power-law scaling
Kaplan et al. (2020) key recommendation
Chinchilla finding (2022)
Compute-optimal training
Inference-optimal training

Definitions

Each order-of-magnitude increase in compute buys a consistent, predictable improvement in model loss
The model size and token count combination that achieves lowest loss for a given FLOP budget
For fixed compute, maximize model size over training tokens because model size is the dominant factor
Training a smaller model on more data than compute-optimal prescribes, reducing deployment cost while preserving capability
Model size and training tokens should scale together; many prior large models were undertrained on data

Drag terms onto their definitions, or click a term then click a definition to match.

What Scaling Laws Do Not Tell You

Scaling laws are powerful, but they have important limits that practitioners must understand. Scaling laws predict average loss on held-out text, a smooth statistical measure. They do not predict the emergence of specific abilities. A model might smoothly improve its next-token prediction loss as it scales, yet suddenly acquire the ability to perform multi-step arithmetic or write coherent code at a specific scale threshold. These jumps, called emergent capabilities, are not predicted by the smooth loss curve and are the subject of Lesson 5. Scaling laws are empirical, not theoretical. They are best-fit curves to observed data, not derivations from first principles. There is no proof that power-law scaling will continue across all scales or all domains. Historical data covers roughly four to five orders of magnitude of compute. Whether scaling continues to produce improvements at the next four orders of magnitude is an open empirical question. Scaling laws assume the data distribution stays consistent. The laws break down when the training data runs out of quality content, when the architecture changes fundamentally, or when publicly available high-quality text has been effectively exhausted. There is an active debate about whether synthetic data generated by models themselves can continue to fuel scaling. Finally, scaling laws say nothing about alignment or safety. A larger model is not necessarily safer, more honest, or more reliably aligned with human values. Performance on downstream tasks can improve dramatically with scale while alignment-related properties such as calibration, honesty, and robustness may improve only modestly or inconsistently.

Scale Is Not the Whole Story

Scaling laws describe average loss on text prediction, a smooth statistical measure. They do not predict specific capabilities, alignment properties, or societal impact. A model with lower cross-entropy loss is not automatically better at reasoning, safer, or more honest. Using loss as a proxy for capability requires careful validation for each specific capability you care about.

An AI lab has a fixed compute budget of 10 million dollars for a training run. According to the Chinchilla paper, what is the approximately optimal strategy?

A company uses scaling law predictions to estimate that 10x more compute will reduce test loss from 2.5 to 2.1 nats. After running the experiment, they find test loss is exactly 2.1 as predicted, but the model has also developed a new ability to write syntactically correct code in a language it almost never saw in training. Which statement best describes this result?

Scaling Law Reasoning

  1. Work through these structured reasoning problems.
  2. Problem 1: A lab trains a 1-billion parameter model on 10 billion tokens and achieves a test loss of 3.2. Then a 10-billion parameter model on 10 billion tokens achieves 2.8. Then 100 billion parameters on 10 billion tokens achieves 2.5. Compute the improvement in loss for each order-of-magnitude increase in parameters. Is the improvement consistent, suggesting a power law? What would you predict for 1 trillion parameters on the same 10 billion tokens?
  3. Problem 2: A company has a 5-million-dollar compute budget. Using the Chinchilla equal-scaling rule, if they choose to train a 7-billion parameter model, approximately how many training tokens should they budget for? If they double to 14 billion parameters, what happens to the required training token count?
  4. Problem 3: A researcher argues that scaling laws prove AI will keep improving indefinitely. List three specific conditions or assumptions that would have to hold for this argument to be valid. For each condition, describe one scenario where it might fail.
  5. Discuss with your class: do scaling laws make AI improvement inevitable, or contingent on conditions that may not hold?