πŸ“„ DeepMind Β· 2022 Β· arXiv:2203.15556

Chinchilla

The paper that proved the entire AI industry was training models wrong β€” and named the fix after a rodent.

For a given compute budget, you'll get a better model by training a smaller model on more data than a huge model on too little data.

// what everyone was doing wrong
❌ Before Chinchilla

The field followed "Kaplan scaling laws" (2020): just make the model bigger. GPT-3 had 175B parameters trained on ~300B tokens. Everyone raced to build larger models β€” 500B, 1T parameters.

βœ“ Chinchilla's finding

Model size and training tokens should scale equally. For every doubling of parameters, double your training data too. GPT-3 was massively undertrained β€” it needed ~3.7 trillion tokens, not 300B.

DeepMind proved this by training 400+ models across a huge range of sizes and data budgets, then fitting the loss curves. Their 70B-parameter Chinchilla model β€” trained on 1.4 trillion tokens β€” beat the 280B parameter Gopher on nearly every benchmark, using the same compute to train.

// the rule of thumb
20Γ—
tokens per param
Optimal token count β‰ˆ 20Γ— the number of parameters
70B
Chinchilla size
Beat 280B Gopher using ΒΌ the parameters, 4Γ— more data
1.4T
Training tokens
~20Γ— more data than comparable models at the time
// the analogy

Think of it like studying for an exam. You have 10 hours. You could spend all 10 hours reading one thick textbook once (big model, little data) β€” or spend 5 hours on a thinner book, reading it twice, carefully, with practice problems (smaller model, more data). Chinchilla showed the second strategy wins.

The compute budget is fixed. The question is how to split it between model capacity and learning examples.

// compute-optimal model estimator

Given a compute budget, Chinchilla's formula tells you the optimal number of parameters and training tokens. Drag the sliders to explore.

10²⁰ (small)10²⁢ (frontier)
// famous models for reference
GPT-3 β€” ~3.1Γ—10Β²Β³ FLOPs
Chinchilla β€” ~5.8Γ—10Β²Β³ FLOPs
LLaMA-2 70B β€” ~10²⁴ FLOPs
GPT-4 est. β€” ~10²⁡ FLOPs
OPTIMAL ALLOCATION
Parameters β€”
Training tokens β€”
Token / param ratio 20Γ—
Compute budget β€”
Compute-optimal split βœ“
// the formula
C = compute budget (FLOPs)
N = optimal parameters β‰ˆ 0.1174 Γ— C0.5
D = optimal tokens β‰ˆ 2.0 Γ— C / (6N)
// approx. from Hoffmann et al. 2022, Eq. 2
// how it changed the field
2020
Kaplan Scaling Laws (OpenAI)
Showed loss scales predictably with compute. Conclusion: bigger models = better. Industry races toward 500B–1T parameter models.
March 2022
⭐ Chinchilla published (DeepMind)
70B model trained on 1.4T tokens beats 280B Gopher. The 20Γ— rule published. Kaplan's conclusion overturned: data matters as much as model size.
Feb 2023
LLaMA (Meta)
Meta's open-source models explicitly follow Chinchilla scaling. LLaMA-7B trained on 1T tokens punches far above its weight class.
2023–2024
Industry-wide pivot
Mistral 7B, Gemma, Phi-series β€” all follow "train smaller models longer." The frontier shifts: efficiency over raw size. Inference cost becomes the new battleground.
2024–2025
Post-Chinchilla era
New research suggests even Chinchilla was conservative β€” models like Llama 3 train far beyond the compute-optimal point because inference is cheap but training is done once.
// why it still matters

Chinchilla didn't just find a better recipe β€” it revealed that the field had been operating on wrong assumptions for years. The lesson goes beyond the specific 20Γ— number: you can't optimize what you don't measure correctly. Every major lab now trains hundreds of small "scaling experiments" before committing to a large run.

The name? DeepMind named each experiment after an animal. The 70B model happened to be a chinchilla. 🐭