The paper that proved the entire AI industry was training models wrong β and named the fix after a rodent.
For a given compute budget, you'll get a better model by training a smaller model on more data than a huge model on too little data.
The field followed "Kaplan scaling laws" (2020): just make the model bigger. GPT-3 had 175B parameters trained on ~300B tokens. Everyone raced to build larger models β 500B, 1T parameters.
Model size and training tokens should scale equally. For every doubling of parameters, double your training data too. GPT-3 was massively undertrained β it needed ~3.7 trillion tokens, not 300B.
DeepMind proved this by training 400+ models across a huge range of sizes and data budgets, then fitting the loss curves. Their 70B-parameter Chinchilla model β trained on 1.4 trillion tokens β beat the 280B parameter Gopher on nearly every benchmark, using the same compute to train.
Think of it like studying for an exam. You have 10 hours. You could spend all 10 hours reading one thick textbook once (big model, little data) β or spend 5 hours on a thinner book, reading it twice, carefully, with practice problems (smaller model, more data). Chinchilla showed the second strategy wins.
The compute budget is fixed. The question is how to split it between model capacity and learning examples.
Given a compute budget, Chinchilla's formula tells you the optimal number of parameters and training tokens. Drag the sliders to explore.
Chinchilla didn't just find a better recipe β it revealed that the field had been operating on wrong assumptions for years. The lesson goes beyond the specific 20Γ number: you can't optimize what you don't measure correctly. Every major lab now trains hundreds of small "scaling experiments" before committing to a large run.
The name? DeepMind named each experiment after an animal. The 70B model happened to be a chinchilla. π