Systems & storage

Probabilistic systems

Data structure

Advanced

HyperLogLog

Estimate the number of distinct items in a huge stream with tiny registers and the statistics of rare hash patterns.

Probabilistic systems is nested under Systems & storage , so techniques in the broader family usually still apply here.

cardinality

streaming

probabilistic

rare

Interactive visualization

Use the controls to connect the idea to concrete operations before diving into the full write-up.

hash suffix

001011

leading zeros = 0

hash suffix

000101

leading zeros = 2

hash suffix

000001

leading zeros = 5

hash suffix

000111

leading zeros = 3

Registers keep the maximum leading-zero count they have seen. Those maxima are then combined into a cardinality estimate.

Approximate distinct counting from hash patterns

If a register sees a hash with many leading zeros, that is evidence the stream is large enough that such a rare pattern has probably appeared. HyperLogLog turns those rare-pattern observations into an estimated distinct count.

Family

Systems & storage → Probabilistic systems

Approximate membership, cardinality, and frequency structures for high-volume systems.

Builds on

1 topic

Read these first if you want the surrounding context.

Unlocks

0 next topics

Use these follow-ups when you want to apply the current idea in a richer setting.

Learning paths

This topic appears in curated progressions so you can keep moving without guessing the next step.

Choose this over that

Pick the sketch that matches the question

HyperLogLog is great at distinct counting, but it is not a replacement for frequency or membership structures.

Probabilistic systems Open topic

Bloom Filter & Cuckoo Filter

Choose this when: The question is "might this key be present?" before touching slower storage.

Choose something else when: The real metric is distinct-user or unique-key count.

Probabilistic systems Open topic

Count-Min Sketch

Choose this when: You need approximate per-item frequency rather than a global unique count.

Choose something else when: The system only needs distinct cardinality.

Current topic

HyperLogLog

Choose this when: The metric is cardinality and the exact set would be too large to store.

Choose something else when: Membership or frequency still matters.

Problem

Exact distinct counting means storing every unique key seen so far. That is often far too expensive for telemetry, analytics, and large distributed systems.

The question becomes:

can I estimate cardinality without storing the full set?

HyperLogLog is the standard answer.

Intuition

Hash every item so the bit patterns look random. Rare patterns become evidence about how many unique items must have been seen.

If a hash suffix begins with many leading zeros, that event is unlikely. Seeing such a rare event suggests the stream is large.

HyperLogLog spreads this logic across many registers:

use the first b hash bits to choose one of m = 2^b registers
use the remaining bits to measure rho, the index of the first 1 bit
store the maximum rho ever observed in that register

Large cardinality produces larger register maxima.

Core operations

Update

update(x):
    h <- hash64(x)
    j <- first b bits of h          // register index, 0 <= j < m
    w <- remaining bits of h        // suffix
    r <- rho(w)                     // number of leading zeros in w, plus 1
    M[j] <- max(M[j], r)

rho(w) is the core statistic. If w = 000101..., then rho(w) = 4 because the first 1 appears after three leading zeros.

Estimate

HyperLogLog combines the register values with a harmonic-mean style formula:

estimate():
    Z <- sum over j of 2^(-M[j])
    E <- alpha_m * m^2 / Z

    if E is small and many registers are still zero:
        return m * ln(m / V)        // linear counting correction

    return bias-corrected E

Where:

V is the number of zero registers
alpha_m is a constant depending on m
real implementations apply bias correction tables as well

You do not need to memorize the constant to understand the structure. The key idea is that larger register maxima push the estimate upward.

Merge

Two HyperLogLogs with the same register layout merge by taking the register-wise maximum.

merge(A, B):
    for j in 0 .. m-1:
        A[j] <- max(A[j], B[j])

That makes HyperLogLog extremely attractive for distributed analytics.

Worked example

Suppose b = 4, so there are m = 16 registers.

A hashed item might look like:

1011 | 000100101...

register index j = 1011 -> register 11
suffix starts with three zeros, then a 1
so rho = 4
update M[11] = max(M[11], 4)

After many inserts, each register stores evidence about how rare the strongest pattern in that bucket has been. HyperLogLog aggregates those signals instead of storing the keys themselves.

Accuracy and parameter choice

HyperLogLog’s relative error is roughly:

\frac{1.04}{\sqrt{m}}

So more registers mean better accuracy.

Examples:

m = 1024 -> about 3.25% relative error
m = 16384 -> about 0.8% relative error

The memory cost grows with m, not with the number of distinct keys.

Complexity

Operation	Time	Space
Update	`O(1)`	`O(m)` registers
Estimate	`O(m)`	`O(m)`
Merge	`O(m)`	`O(m)`

When to choose HyperLogLog

Structure	Answers	Typical memory story	Best for
Hash set	Exact cardinality	One entry per distinct key	Exact distinct counting
Bloom filter	Membership	Bit array	”Probably present?” questions
Count-Min Sketch	Frequency	Counter matrix	Heavy-hitter and admission logic
HyperLogLog	Distinct count	Fixed register array	Unique users, unique sessions, unique keys

Key takeaways

HyperLogLog estimates cardinality, not membership and not per-item frequency.
The structure works because long runs of leading zeros in random hashes are rare and therefore informative.
The update rule is tiny: choose a register, compute rho, keep the maximum.
Merge is just register-wise max, which is why HyperLogLog scales so well across machines.
The accuracy knob is the number of registers m: more registers cost more memory but reduce relative error.

Practice problems

Problem	Difficulty	Key idea
HyperLogLog overview	Hard	Register maxima from rare hash patterns
HyperLogLog in Practice	Hard	Bias correction and practical engineering details
Redis HyperLogLog docs	Hard	Real-world cardinality estimation API