Why "1-bit" LLMs
are actually 1.58-bit
For the past couple of years I kept hearing people talk about "1-bit models." Sounds simple — a bit is a bit, right? Zero or one. But if you dig one layer deeper, the math tells a different story.
In standard deep learning, weights are floating-point numbers — typically 32-bit (FP32) or 16-bit (FP16). A single weight can represent a continuous value like −0.3871 or 1.1420. Quantization is the process of shrinking those values down to use fewer bits, reducing memory and compute at inference time.
BitNet b1.58 (Microsoft, 2024) constrains every weight to exactly one of three values: −1, 0, or +1. The name comes from the information-theoretic cost of storing one of three choices.
True binary (0/1) can simulate a zero by convention, but ternary natively encodes it. When a weight is zero, the entire multiply-accumulate operation can be skipped. This is why 1.58-bit models can run matrix multiplications using only additions — no floating-point multiplies needed. On hardware that supports it, this translates to both energy and latency wins, not just memory savings.
Think of it like a neural network switch: a weight of +1 passes the signal through, −1 inverts it, and 0 kills it entirely. You get directionality, negation, and gating — all without a single floating-point multiply.