Chinchilla — The Paper That Changed AI

For a given compute budget, you'll get a better model by training a smaller model on more data than a huge model on too little data.

// what everyone was doing wrong

❌ Before Chinchilla

The field followed "Kaplan scaling laws" (2020): just make the model bigger. GPT-3 had 175B parameters trained on ~300B tokens. Everyone raced to build larger models — 500B, 1T parameters.

✓ Chinchilla's finding

Model size and training tokens should scale equally. For every doubling of parameters, double your training data too. GPT-3 was massively undertrained — it needed ~3.7 trillion tokens, not 300B.

DeepMind proved this by training 400+ models across a huge range of sizes and data budgets, then fitting the loss curves. Their 70B-parameter Chinchilla model — trained on 1.4 trillion tokens — beat the 280B parameter Gopher on nearly every benchmark, using the same compute to train.

// the rule of thumb

20×

tokens per param

Optimal token count ≈ 20× the number of parameters

70B

Chinchilla size

Beat 280B Gopher using ¼ the parameters, 4× more data

1.4T

Training tokens

~20× more data than comparable models at the time

// the analogy

Think of it like studying for an exam. You have 10 hours. You could spend all 10 hours reading one thick textbook once (big model, little data) — or spend 5 hours on a thinner book, reading it twice, carefully, with practice problems (smaller model, more data). Chinchilla showed the second strategy wins.

The compute budget is fixed. The question is how to split it between model capacity and learning examples.

// compute-optimal model estimator

Given a compute budget, Chinchilla's formula tells you the optimal number of parameters and training tokens. Drag the sliders to explore.

Compute budget (FLOPs) 10²³

10²⁰ (small)10²⁶ (frontier)

// famous models for reference

GPT-3 — ~3.1×10²³ FLOPs

Chinchilla — ~5.8×10²³ FLOPs

LLaMA-2 70B — ~10²⁴ FLOPs

GPT-4 est. — ~10²⁵ FLOPs

OPTIMAL ALLOCATION

Parameters —

Training tokens —

Token / param ratio 20×

Compute budget —

Compute-optimal split ✓

// the formula

C = compute budget (FLOPs)

N = optimal parameters ≈ 0.1174 × C^0.5

D = optimal tokens ≈ 2.0 × C / (6N)

// approx. from Hoffmann et al. 2022, Eq. 2

// how it changed the field

2020

Kaplan Scaling Laws (OpenAI)

Showed loss scales predictably with compute. Conclusion: bigger models = better. Industry races toward 500B–1T parameter models.

March 2022
⭐ Chinchilla published (DeepMind)
70B model trained on 1.4T tokens beats 280B Gopher. The 20× rule published. Kaplan's conclusion overturned: data matters as much as model size.

Feb 2023

LLaMA (Meta)

Meta's open-source models explicitly follow Chinchilla scaling. LLaMA-7B trained on 1T tokens punches far above its weight class.

2023–2024

Industry-wide pivot

Mistral 7B, Gemma, Phi-series — all follow "train smaller models longer." The frontier shifts: efficiency over raw size. Inference cost becomes the new battleground.

2024–2025

Post-Chinchilla era

New research suggests even Chinchilla was conservative — models like Llama 3 train far beyond the compute-optimal point because inference is cheap but training is done once.

// why it still matters

Chinchilla didn't just find a better recipe — it revealed that the field had been operating on wrong assumptions for years. The lesson goes beyond the specific 20× number: you can't optimize what you don't measure correctly. Every major lab now trains hundreds of small "scaling experiments" before committing to a large run.

The name? DeepMind named each experiment after an animal. The 70B model happened to be a chinchilla. 🐭