You On AI Encyclopedia · Scaling Laws The You On AI Encyclopedia Home
Txt Low Med High
CONCEPT

Scaling Laws

The empirical relationships that predict how a language model's loss decreases with training compute, parameters, and data — the most reliable quantitative instrument the AI field has, and the reason investors have been willing to fund ten-figure training runs.
Scaling laws are empirical power-law relationships, discovered by Hestness et al. (2017) and formalized for language models by Kaplan et al. (2020) and Hoffmann et al. (2022, the "Chinchilla" paper), between a transformer language model's training loss and the three inputs most under practitioner control: compute, parameters, and tokens. The relationships hold across five orders of magnitude and have been the most reliable forecasting instrument in the field for the past five years. They predict that doubling compute reduces loss by a known fraction, that the optimal parameter–data ratio scales predictably, and — most consequentially — that continued investment in scale will continue to produce capability gains until something structural breaks.
Scaling Laws
Scaling Laws

In The You On AI Encyclopedia

The Kaplan paper's headline finding was that language-model loss decreases as a power law in compute, parameters, and dataset size — and that the exponents of these power laws are stable across model sizes and architectures. The result established that scale is a first-order input to capability, not an engineering detail. It also established that the field could, for the first time, make forecasts grounded in measured regularities rather than expert opinion.

Chinchilla (Hoffmann et al., 2022) revised Kaplan's prescriptions: optimal training requires roughly 20 tokens per parameter, not the parameter-heavy regime the Kaplan paper suggested. The revision was consequential — it explained why older models like GPT-3 (which were parameter-rich but data-starved) underperformed newer models trained on more tokens. Chinchilla-style recipes have dominated frontier pretraining since, with the ratio later pushed to 100+ tokens per parameter by Llama-3 and related work.

Large Language Models
Large Language Models

The scaling laws are the reason for the present AI investment cycle. A company deciding whether to spend two billion dollars on a training run is making a bet grounded in the laws' forecast that the resulting model will be meaningfully more capable than its predecessor. This is a radical empirical claim — no other technology sector has ever had quantitative capability forecasts this reliable. The risk is that the laws are limited to a regime (text, roughly transformer architectures, standard data) and that the regime is ending. Emergent capabilities complicate the picture further by introducing step-function gains the smooth loss curves do not predict.

The forecasting utility of the laws is uneven. They predict loss extremely well; they predict downstream capabilities (benchmark scores, reasoning, code generation) less well; they predict economic impact worse still. The current frontier of the field is determining whether the next order of magnitude in compute will continue to produce the capability gains the laws project — and whether, when the laws break, the break will be gentle (sub-linear returns to compute) or sharp (qualitative new phenomena). The answer will shape the next decade of AI trajectory.

Origin

The scaling-law research program began with Hestness et al.'s Deep Learning Scaling is Predictable, Empirically (2017). Kaplan et al.'s Scaling Laws for Neural Language Models (2020) generalized the approach to transformer LMs. Hoffmann et al.'s Training Compute-Optimal Large Language Models (2022) corrected the Kaplan recipe. Subsequent work by Anthropic, DeepMind, and independent researchers has extended the laws to multi-modal settings, reasoning benchmarks, and post-training stages.

Key Ideas

Power laws are the regularity. Loss decreases as a power of compute, parameters, and data — five-order-of-magnitude stability makes this the most robust empirical law in the field.

Next-Token Prediction
Next-Token Prediction

Optimal allocation is the operational insight. For a given compute budget, the Chinchilla ratio (tokens ≈ 20× parameters, now higher) guides frontier training.

Capability is not directly predicted. Loss scales cleanly; downstream capabilities scale less cleanly; economic impact barely at all.

The regime-end is the frontier uncertainty. Whether continued scaling produces continued gains, and if not how the transition unfolds, is the highest-leverage open question.

Further Reading

  1. Hestness, Joel et al. Deep Learning Scaling is Predictable, Empirically (2017).
  2. Kaplan, Jared et al. Scaling Laws for Neural Language Models (2020).
  3. Hoffmann, Jordan et al. Training Compute-Optimal Large Language Models (Chinchilla, 2022).
  4. Henighan, Tom et al. Scaling Laws for Autoregressive Generative Modeling (2020).
  5. Sardana, Nikhil et al. Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws (2023).

Three Positions on Scaling Laws

From Chapter 15 — how the Boulder, the Believer, and the Beaver each read this concept
Boulder · Refusal
Han's diagnosis
The Boulder sees in Scaling Laws evidence of the pathology — that refusal, not adaptation, is the correct posture. The garden, the analog life, the smartphone that is not bought.
Believer · Flow
Riding the current
The Believer sees Scaling Laws as the river's direction — lean in. Trust that the technium, as Kevin Kelly argues, wants what life wants. Resistance is fear, not wisdom.
Beaver · Stewardship
Building dams
The Beaver sees Scaling Laws as an opportunity for construction. Neither refuse nor surrender — build the institutional, attentional, and craft governors that shape the river around the things worth preserving.

Read Chapter 15 in the book →

Explore more
Browse the full You On AI Encyclopedia — over 8,500 entries
← Home 0%
CONCEPT Book →