Skip to main content
Back to the DSA atlas
Systems & storageProbabilistic systems Data structure Advanced

HyperLogLog

Estimate the number of distinct items in a huge stream with tiny registers and the statistics of rare hash patterns.

Probabilistic systems is nested under Systems & storage , so techniques in the broader family usually still apply here.

cardinalitystreamingprobabilisticrare

Interactive visualization

Use the controls to connect the idea to concrete operations before diving into the full write-up.

hash suffix

001011

leading zeros = 0

hash suffix

000101

leading zeros = 2

hash suffix

000001

leading zeros = 5

hash suffix

000111

leading zeros = 3

Registers keep the maximum leading-zero count they have seen. Those maxima are then combined into a cardinality estimate.

Approximate distinct counting from hash patterns

If a register sees a hash with many leading zeros, that is evidence the stream is large enough that such a rare pattern has probably appeared. HyperLogLog turns those rare-pattern observations into an estimated distinct count.

Family

Systems & storage → Probabilistic systems

Approximate membership, cardinality, and frequency structures for high-volume systems.

Builds on

1 topic

Read these first if you want the surrounding context.

Unlocks

0 next topics

Use these follow-ups when you want to apply the current idea in a richer setting.

Learning paths

1

This topic appears in curated progressions so you can keep moving without guessing the next step.

Choose this over that

Pick the sketch that matches the question

HyperLogLog is great at distinct counting, but it is not a replacement for frequency or membership structures.

Probabilistic systems Open topic

Bloom Filter & Cuckoo Filter

Choose this when: The question is "might this key be present?" before touching slower storage.

Choose something else when: The real metric is distinct-user or unique-key count.

Probabilistic systems Open topic

Count-Min Sketch

Choose this when: You need approximate per-item frequency rather than a global unique count.

Choose something else when: The system only needs distinct cardinality.

Current topic

HyperLogLog

Choose this when: The metric is cardinality and the exact set would be too large to store.

Choose something else when: Membership or frequency still matters.

Problem

Exact distinct counting means storing every unique key seen so far. That is often far too expensive for telemetry, analytics, and large distributed systems.

The question becomes:

can I estimate cardinality without storing the full set?

HyperLogLog is the standard answer.

Intuition

Hash every item so the bit patterns look random. Rare patterns become evidence about how many unique items must have been seen.

If a hash suffix begins with many leading zeros, that event is unlikely. Seeing such a rare event suggests the stream is large.

HyperLogLog spreads this logic across many registers:

  1. use the first b hash bits to choose one of m = 2^b registers
  2. use the remaining bits to measure rho, the index of the first 1 bit
  3. store the maximum rho ever observed in that register

Large cardinality produces larger register maxima.

Core operations

Update

update(x):
    h <- hash64(x)
    j <- first b bits of h          // register index, 0 <= j < m
    w <- remaining bits of h        // suffix
    r <- rho(w)                     // number of leading zeros in w, plus 1
    M[j] <- max(M[j], r)

rho(w) is the core statistic. If w = 000101..., then rho(w) = 4 because the first 1 appears after three leading zeros.

Estimate

HyperLogLog combines the register values with a harmonic-mean style formula:

estimate():
    Z <- sum over j of 2^(-M[j])
    E <- alpha_m * m^2 / Z

    if E is small and many registers are still zero:
        return m * ln(m / V)        // linear counting correction

    return bias-corrected E

Where:

  • V is the number of zero registers
  • alpha_m is a constant depending on m
  • real implementations apply bias correction tables as well

You do not need to memorize the constant to understand the structure. The key idea is that larger register maxima push the estimate upward.

Merge

Two HyperLogLogs with the same register layout merge by taking the register-wise maximum.

merge(A, B):
    for j in 0 .. m-1:
        A[j] <- max(A[j], B[j])

That makes HyperLogLog extremely attractive for distributed analytics.

Worked example

Suppose b = 4, so there are m = 16 registers.

A hashed item might look like:

1011 | 000100101...

  • register index j = 1011 -> register 11
  • suffix starts with three zeros, then a 1
  • so rho = 4
  • update M[11] = max(M[11], 4)

After many inserts, each register stores evidence about how rare the strongest pattern in that bucket has been. HyperLogLog aggregates those signals instead of storing the keys themselves.

Accuracy and parameter choice

HyperLogLog’s relative error is roughly:

1.04m\frac{1.04}{\sqrt{m}}

So more registers mean better accuracy.

Examples:

  • m = 1024 -> about 3.25% relative error
  • m = 16384 -> about 0.8% relative error

The memory cost grows with m, not with the number of distinct keys.

Complexity

OperationTimeSpace
UpdateO(1)O(m) registers
EstimateO(m)O(m)
MergeO(m)O(m)

When to choose HyperLogLog

StructureAnswersTypical memory storyBest for
Hash setExact cardinalityOne entry per distinct keyExact distinct counting
Bloom filterMembershipBit array”Probably present?” questions
Count-Min SketchFrequencyCounter matrixHeavy-hitter and admission logic
HyperLogLogDistinct countFixed register arrayUnique users, unique sessions, unique keys

Key takeaways

  • HyperLogLog estimates cardinality, not membership and not per-item frequency.
  • The structure works because long runs of leading zeros in random hashes are rare and therefore informative.
  • The update rule is tiny: choose a register, compute rho, keep the maximum.
  • Merge is just register-wise max, which is why HyperLogLog scales so well across machines.
  • The accuracy knob is the number of registers m: more registers cost more memory but reduce relative error.

Practice problems

ProblemDifficultyKey idea
HyperLogLog overviewHardRegister maxima from rare hash patterns
HyperLogLog in PracticeHardBias correction and practical engineering details
Redis HyperLogLog docsHardReal-world cardinality estimation API

Relation to other topics

  • Bloom Filter & Cuckoo Filter approximate membership.
  • Count-Min Sketch approximates per-item frequency.
  • HyperLogLog fills the third major sketch niche: approximate distinct counting.

Build on these first

These topics supply the mental model or underlying structure that this page assumes.

Related directions

These topics live nearby conceptually, even if they are not direct prerequisites.

More from Probabilistic systems

Stay in the same family when you want parallel variations of the same mental model.

Paths that include this topic

Follow one of these sequences if you want a guided next step instead of open-ended browsing.

Probabilistic systems

Learn how approximate membership, frequency, and cardinality structures trade tiny errors for massive scale.

From the blog

Pair the atlas with recent writing from the rest of the site when you want a broader engineering perspective alongside the topic graph.