The Problem: Counting Is Expensive
Every analytics system eventually hits the same wall. Your product manager asks: "How many unique visitors did we have last month?" Sounds simple. But at scale—billions of events, thousands of queries per second—the naive solution turns into a nightmare.
The textbook approach is a hash set: store every unique item you've seen, and take the size at query time. It's exact, but the cost is brutal. One billion 64-bit IDs consume roughly 8 GB of RAM—per counter. Redis's PFADD/PFCOUNT serves the same query from 12 KB, with less than 1% error. That's a 600,000× improvement in memory.
The family of algorithms that solve this is called cardinality estimation, and HyperLogLog (HLL) is the reigning champion—used by Redis, BigQuery, Spark, Presto, Druid, and virtually every modern data warehouse.
How It Works: The Beautiful Trick
HyperLogLog is built on a deceptively simple observation about random binary strings.
The coin-flip intuition: Imagine flipping a fair coin until you get heads. If the longest run of tails you've ever seen in your life is k, that tells you something about how many times you've flipped. If your longest streak is 3, you've probably flipped ~8 times. If it's 10, you've probably flipped ~1024 times. The maximum leading zeros in a hash acts exactly like this.
Here's the core algorithm in four steps:
Apply a uniform hash function to each item. This turns "user_99" into a random-looking 64-bit string like 00001101…
Use the first b bits to route the hash into one of m = 2ᵇ registers. With b=14, you get 16,384 buckets.
In the remaining bits, count the position of the first 1. Store the maximum such value seen in each register.
Combine all registers using a harmonic mean formula to produce the cardinality estimate, with bias correction applied.
A Crisp Example
Let's trace through four values with a toy 4-bucket HLL. We use b=2, which means the first 2 bits of every hash select the bucket, and the remaining bits are used to count leading zeros.
That gives us m = 2ᵇ = 2² = 4 registers. Think of m as the number of independent "lanes" we run in parallel. Each lane keeps its own personal record of the longest leading-zero streak it has ever seen. More lanes = more averaging = lower error, but more memory.
Breaking down the formula term by term:
α is a small bias-correction constant, mathematically derived to cancel a systematic overcount introduced by the harmonic mean. For m = 4 it equals roughly 0.532; for large m it converges to ~0.7213. You look it up once and bake it in.
m² scales the estimate up because each register only saw 1/m of the stream (items are split across buckets). Squaring m compensates for both the splitting and the normalisation inside the harmonic mean.
harmonic_mean(2^max[i])⁻¹ is the core statistical engine. For each register i, we compute 2^max[i]—the implied cardinality if that register were the only one. We then take the harmonic mean of all m such values. Plugging in our example:
Notice how bucket 1's outlier value of 16 barely moves the needle. If we had used an arithmetic mean instead, bucket 1 would dominate and the estimate would balloon to ~10. The harmonic mean's resistance to outliers is precisely why it was chosen—a single lucky hash with 20 leading zeros won't ruin your count.
The constant α is a correction factor derived from the math, and Redis applies additional small-range and large-range corrections to hit a standard error of ±0.81% using only 12 KB of memory.
b=14 → 16,384 buckets. Each bucket needs 6 bits to store the max leading-zero count (values 0–63). That's 16,384 × 6 bits = 98,304 bits ≈ 12 KB. Exact. Fixed. Forever—regardless of whether you've seen 100 or 100 billion distinct elements.
HLL is also mergeable: to combine two HLL sketches (say, Monday's and Tuesday's unique visitors), you take the per-bucket maximum. The merged sketch correctly estimates the union. This makes distributed computation trivial—run HLL on each shard, merge the registers, done.
Interactive Demo
Here's a live HyperLogLog running in your browser. Stream in random elements, watch the registers update, and see how close the estimate stays to the true count. About 15% of insertions re-insert existing elements to simulate real-world duplicates.
Conclusion
HyperLogLog is one of the most elegant algorithms in computer science. It trades a small, bounded amount of accuracy for an astronomical reduction in memory—turning an O(n) problem into O(1). The core insight—that the maximum number of leading zeros in a hash function behaves like a fingerprint of cardinality—is the kind of idea that makes you want to go outside and stare at the sky for a while.
If you're building systems that need to answer "how many distinct things?", HLL belongs in your toolkit. It's production-hardened, mergeable across shards, incrementally updatable, and provably correct in its error bounds. Redis exposes it out of the box with PFADD and PFCOUNT. Google BigQuery uses it behind APPROX_COUNT_DISTINCT(). Apache Flink ships it as a native window aggregation. The algorithm is 20 years old and it's still the answer.
The next time someone asks "how many unique users?"—you know exactly which twelve kilobytes to reach for.