Systems · LLM Internals

Why "1-bit" LLMs
are actually 1.58-bit


For the past couple of years I kept hearing people talk about "1-bit models." Sounds simple — a bit is a bit, right? Zero or one. But if you dig one layer deeper, the math tells a different story.

In standard deep learning, weights are floating-point numbers — typically 32-bit (FP32) or 16-bit (FP16). A single weight can represent a continuous value like −0.3871 or 1.1420. Quantization is the process of shrinking those values down to use fewer bits, reducing memory and compute at inference time.

True binary (1 bit)
0 1
2 possible values. Can represent zero only by convention — no true zero exists.
BitNet / 1.58-bit
−1 0 +1
3 possible values. True zero matters — it lets the model completely ignore a connection.

BitNet b1.58 (Microsoft, 2024) constrains every weight to exactly one of three values: −1, 0, or +1. The name comes from the information-theoretic cost of storing one of three choices.

// How many bits do you need to represent 3 distinct states? bits_needed = log₂(3) = 1.58496... // True binary: log₂(2) = 1.00 bit exactly // Ternary: log₂(3) ≈ 1.58 bits per weight // Example — a 7B parameter model: FP16 storage = 7,000,000,000 × 16 bits = 112 GB 1.58-bit storage ≈ 7,000,000,000 × 1.58 bits ≈ ~1.4 GB
FP32
32 bits
FP16
16 bits
1.58-bit
1.58
Weight states
3
−1 · 0 · +1
Bits per weight
1.58
log₂(3) exactly
vs FP16
10×
smaller footprint

True binary (0/1) can simulate a zero by convention, but ternary natively encodes it. When a weight is zero, the entire multiply-accumulate operation can be skipped. This is why 1.58-bit models can run matrix multiplications using only additions — no floating-point multiplies needed. On hardware that supports it, this translates to both energy and latency wins, not just memory savings.

Think of it like a neural network switch: a weight of +1 passes the signal through, −1 inverts it, and 0 kills it entirely. You get directionality, negation, and gating — all without a single floating-point multiply.

The punchline: When people say "1-bit model," they usually mean a model where each weight is one of {−1, 0, +1}. That's ternary, not binary. The honest name is 1.58-bit, because that's exactly how much information entropy each weight carries — log₂(3). The shorthand "1-bit" stuck for marketing reasons, but the math behind it is more interesting than the name suggests.