HyperLogLog
Estimate the number of distinct items in a huge stream with tiny registers and the statistics of rare hash patterns.
Probabilistic systems is nested under Systems & storage , so techniques in the broader family usually still apply here.
Interactive visualization
Use the controls to connect the idea to concrete operations before diving into the full write-up.
hash suffix
001011
leading zeros = 0
hash suffix
000101
leading zeros = 2
hash suffix
000001
leading zeros = 5
hash suffix
000111
leading zeros = 3
Approximate distinct counting from hash patterns
If a register sees a hash with many leading zeros, that is evidence the stream is large enough that such a rare pattern has probably appeared. HyperLogLog turns those rare-pattern observations into an estimated distinct count.
Family
Systems & storage → Probabilistic systems
Approximate membership, cardinality, and frequency structures for high-volume systems.
Builds on
1 topic
Read these first if you want the surrounding context.
Unlocks
0 next topics
Use these follow-ups when you want to apply the current idea in a richer setting.
Learning paths
1
This topic appears in curated progressions so you can keep moving without guessing the next step.
Choose this over that
Pick the sketch that matches the question
HyperLogLog is great at distinct counting, but it is not a replacement for frequency or membership structures.
Bloom Filter & Cuckoo Filter
Choose this when: The question is "might this key be present?" before touching slower storage.
Choose something else when: The real metric is distinct-user or unique-key count.
Count-Min Sketch
Choose this when: You need approximate per-item frequency rather than a global unique count.
Choose something else when: The system only needs distinct cardinality.
HyperLogLog
Choose this when: The metric is cardinality and the exact set would be too large to store.
Choose something else when: Membership or frequency still matters.
Problem
Exact distinct counting means storing every unique key seen so far. That is often far too expensive for telemetry, analytics, and large distributed systems.
The question becomes:
can I estimate cardinality without storing the full set?
HyperLogLog is the standard answer.
Intuition
Hash every item so the bit patterns look random. Rare patterns become evidence about how many unique items must have been seen.
If a hash suffix begins with many leading zeros, that event is unlikely. Seeing such a rare event suggests the stream is large.
HyperLogLog spreads this logic across many registers:
- use the first
bhash bits to choose one ofm = 2^bregisters - use the remaining bits to measure
rho, the index of the first1bit - store the maximum
rhoever observed in that register
Large cardinality produces larger register maxima.
Core operations
Update
update(x):
h <- hash64(x)
j <- first b bits of h // register index, 0 <= j < m
w <- remaining bits of h // suffix
r <- rho(w) // number of leading zeros in w, plus 1
M[j] <- max(M[j], r)
rho(w) is the core statistic. If w = 000101..., then rho(w) = 4 because the first 1 appears after three leading zeros.
Estimate
HyperLogLog combines the register values with a harmonic-mean style formula:
estimate():
Z <- sum over j of 2^(-M[j])
E <- alpha_m * m^2 / Z
if E is small and many registers are still zero:
return m * ln(m / V) // linear counting correction
return bias-corrected E
Where:
Vis the number of zero registersalpha_mis a constant depending onm- real implementations apply bias correction tables as well
You do not need to memorize the constant to understand the structure. The key idea is that larger register maxima push the estimate upward.
Merge
Two HyperLogLogs with the same register layout merge by taking the register-wise maximum.
merge(A, B):
for j in 0 .. m-1:
A[j] <- max(A[j], B[j])
That makes HyperLogLog extremely attractive for distributed analytics.
Worked example
Suppose b = 4, so there are m = 16 registers.
A hashed item might look like:
1011 | 000100101...
- register index
j = 1011-> register 11 - suffix starts with three zeros, then a
1 - so
rho = 4 - update
M[11] = max(M[11], 4)
After many inserts, each register stores evidence about how rare the strongest pattern in that bucket has been. HyperLogLog aggregates those signals instead of storing the keys themselves.
Accuracy and parameter choice
HyperLogLog’s relative error is roughly:
So more registers mean better accuracy.
Examples:
m = 1024-> about 3.25% relative errorm = 16384-> about 0.8% relative error
The memory cost grows with m, not with the number of distinct keys.
Complexity
| Operation | Time | Space |
|---|---|---|
| Update | O(1) | O(m) registers |
| Estimate | O(m) | O(m) |
| Merge | O(m) | O(m) |
When to choose HyperLogLog
| Structure | Answers | Typical memory story | Best for |
|---|---|---|---|
| Hash set | Exact cardinality | One entry per distinct key | Exact distinct counting |
| Bloom filter | Membership | Bit array | ”Probably present?” questions |
| Count-Min Sketch | Frequency | Counter matrix | Heavy-hitter and admission logic |
| HyperLogLog | Distinct count | Fixed register array | Unique users, unique sessions, unique keys |
Key takeaways
- HyperLogLog estimates cardinality, not membership and not per-item frequency.
- The structure works because long runs of leading zeros in random hashes are rare and therefore informative.
- The update rule is tiny: choose a register, compute
rho, keep the maximum. - Merge is just register-wise
max, which is why HyperLogLog scales so well across machines. - The accuracy knob is the number of registers
m: more registers cost more memory but reduce relative error.
Practice problems
| Problem | Difficulty | Key idea |
|---|---|---|
| HyperLogLog overview | Hard | Register maxima from rare hash patterns |
| HyperLogLog in Practice | Hard | Bias correction and practical engineering details |
| Redis HyperLogLog docs | Hard | Real-world cardinality estimation API |
Relation to other topics
- Bloom Filter & Cuckoo Filter approximate membership.
- Count-Min Sketch approximates per-item frequency.
- HyperLogLog fills the third major sketch niche: approximate distinct counting.
Build on these first
These topics supply the mental model or underlying structure that this page assumes.
Related directions
These topics live nearby conceptually, even if they are not direct prerequisites.
Bloom Filter & Cuckoo Filter
Use tiny approximate membership structures to reject absent keys cheaply before touching slower storage or larger indexes.
Count-Min Sketch
Estimate item frequencies in a stream with a tiny counter matrix, a few hash functions, and a guaranteed one-sided error.
More from Probabilistic systems
Stay in the same family when you want parallel variations of the same mental model.
Bloom Filter & Cuckoo Filter
Use tiny approximate membership structures to reject absent keys cheaply before touching slower storage or larger indexes.
Count-Min Sketch
Estimate item frequencies in a stream with a tiny counter matrix, a few hash functions, and a guaranteed one-sided error.
Paths that include this topic
Follow one of these sequences if you want a guided next step instead of open-ended browsing.
Probabilistic systems
Learn how approximate membership, frequency, and cardinality structures trade tiny errors for massive scale.
From the blog
Pair the atlas with recent writing from the rest of the site when you want a broader engineering perspective alongside the topic graph.