Aashish Khatri | Software Engineer

How does Google count 3 billion daily active users across Search, YouTube, and Gmail? How does Netflix know exactly how many unique accounts watched Squid Game in its first week? The naive answer — "just keep a set of user IDs" — falls apart at scale.

The difference between theory and practice is that in theory, there is no difference between theory and practice.

This article tells the story of HyperLogLog through the lens of the companies that bet their analytics infrastructure on it.

The $16 Billion Memory Problem

Consider the problem: you're an engineer at Google, and you need to count unique daily users. You have 3 billion users, each with a 128-bit UUID.

3 \times 10^9 \times 16 \text{ bytes} = 48 \text{ GB}

That's 48 GB of RAM — per counter. Now multiply by every dimension you want to track: unique users per country, per product, per feature flag. You'd need petabytes of RAM just for counting.

The breakthrough insight: what if we could trade 2% accuracy for 99.9999% memory reduction?

Case Study 1: Google BigQuery — 93% Cost Reduction

In 2023, a data analytics company documented their migration from COUNT(DISTINCT) to HyperLogLog in BigQuery. The results were staggering:

Metric	Before HLL	After HLL
Query Cost	$X	0.07X (93% reduction)
Data Scanned	6.5 TB	16.25 GB
Query Time	Hours	Seconds

The transformation was simple. Instead of:

SELECT COUNT(DISTINCT user_id) 
FROM events 
WHERE date BETWEEN '2023-01-01' AND '2023-12-31';

They pre-computed daily HLL sketches:

-- Daily sketch creation
SELECT 
  date,
  HLL_COUNT.INIT(user_id) AS user_sketch
FROM events
GROUP BY date;

-- Query any date range by merging sketches
SELECT HLL_COUNT.EXTRACT(
  HLL_COUNT.MERGE(user_sketch)
) AS unique_users
FROM daily_sketches
WHERE date BETWEEN '2023-01-01' AND '2023-12-31';

The key insight: HLL sketches are mergeable. You can compute daily sketches cheaply, then combine them for any arbitrary date range without re-scanning raw data.

Case Study 2: Meta Presto — 1000x Speedup

Meta's engineering blog documents their HyperLogLog implementation in Presto, their distributed SQL engine. Before HLL, computing distinct counts across their data lake was "prohibitively expensive":

Some queries took days to complete
Required terabytes of intermediate memory

After implementing APPROX_DISTINCT:

Queries run in minutes (7x-1000x speedup)
Memory usage dropped to < 1 MB
Error rate: approximately 2.3%

Meta's implementation uses a clever hybrid approach:

if (cardinality < 256):
    use SPARSE layout (exact counting)
else:
    switch to DENSE layout (~8KB, approximate)

This gives exact results for low-cardinality columns while providing HLL efficiency for high-cardinality ones — all transparent to the user.

Case Study 3: Netflix Muse — Trillion-Row Analytics

Netflix's internal "Muse" platform powers data-driven creative insights. They need to answer questions like:

How many unique accounts saw the thumbnail for Stranger Things?
What's the reach of a specific promotional asset?

At Netflix scale, this means counting distinct users across trillion-row datasets. Their solution: HLL sketches that achieved:

50% reduction in query latency
~2% error rate (acceptable for creative analytics)
Real-time insights instead of overnight batch jobs

The Netflix tech blog specifically calls out HLL's role in tracking "impressions" (unique users who saw an asset) and "qualified plays" (unique users who watched for a minimum duration).

Case Study 4: Amazon Redshift — Native HLLSKETCH

In October 2020, Amazon Redshift introduced native HyperLogLog support with a dedicated HLLSKETCH data type. This wasn't just a function — it's a first-class citizen in the type system.

-- Create a table with HLL column
CREATE TABLE daily_metrics (
  date DATE,
  user_sketch HLLSKETCH
);

-- Populate with sketches
INSERT INTO daily_metrics
SELECT 
  event_date,
  HLL_CREATE_SKETCH(user_id)
FROM raw_events
GROUP BY event_date;

-- Query with merge
SELECT 
  HLL_CARDINALITY(HLL_COMBINE(user_sketch)) AS monthly_uniques
FROM daily_metrics
WHERE date BETWEEN '2024-01-01' AND '2024-01-31';

Amazon's implementation achieves 0.01% - 0.6% error — significantly better than the theoretical 2% bound. The key innovation: using a higher precision parameter (m = 15 by default, configurable up to 26).

Case Study 5: Redis PFCOUNT — 12KB for 2^64 Elements

Redis is perhaps the most elegant HLL implementation. With just three commands, you can count unique elements at any scale:

# Add elements
PFADD visitors "user:1" "user:2" "user:3"
PFADD visitors "user:2" "user:4"  # user:2 is a duplicate

# Count unique elements
PFCOUNT visitors
# Returns: 4

# Merge multiple HLLs
PFMERGE total visitors_day1 visitors_day2 visitors_day3
PFCOUNT total

The "PF" prefix honors Philippe Flajolet, the French computer scientist who invented the algorithm. Redis uses:

12KB fixed memory per HLL
16,384 registers (buckets)
0.81% standard error

Redis documentation notes that even with 2^64 possible elements, the HLL data structure never grows beyond 12KB.

The Algorithm: First Principles

Now that we've seen HLL in production, let's understand why it works.

The Coin Flip Intuition

Imagine flipping a fair coin until you get heads. The probability of seeing $k$ tails before the first head is:

P(k \text{ tails}) = \left(\frac{1}{2}\right)^{k+1}

If someone tells you "I flipped 10 tails in a row before getting heads," you'd estimate they probably flipped the coin about $2^{10} = 1024$ times total. This "rare event" gives information about sample size.

From Coin Flips to Hash Functions

HyperLogLog applies this principle to hash values:

Hash each element to a uniformly distributed binary string
Count leading zeros in the hash (equivalent to "tails before heads")
Track the maximum leading zeros seen

If we observe a hash with 20 leading zeros, we estimate $\approx 2^{20}$ unique elements have been hashed, because that's how many trials we'd expect before seeing such a "rare" hash.

The Bucketing Trick

A single maximum is too noisy. HyperLogLog uses $m$ registers (typically 16,384):

Use the first $p$ bits of the hash to select a register ( $m = 2^p$ )
Use remaining bits to count leading zeros
Store only the maximum leading zeros per register

The cardinality estimate uses the harmonic mean of all register values:

E = \alpha_m \cdot m^2 \cdot \left( \sum_{j=1}^{m} 2^{-R_j} \right)^{-1}

Where $\alpha_m$ is a bias correction constant ( $\alpha_{16384} \approx 0.7213$ ) and $R_j$ is the max leading zeros in register $j$ .

The standard error is provably bounded:

\sigma = \frac{1.04}{\sqrt{m}}

For $m = 16,384$ : $\sigma \approx 0.81\%$ .

C++ Implementation

Here's a production-ready implementation with all corrections:

#include <cstdint>
#include <cmath>
#include <vector>
#include <functional>

class HyperLogLog {
private:
    static constexpr int P = 14;           // Precision
    static constexpr int M = 1 << P;       // 16384 registers
    static constexpr double ALPHA = 0.7213 / (1.0 + 1.079 / M);
    
    std::vector<uint8_t> registers;
    
    // Count leading zeros in the lower (32-P) bits
    static uint8_t countLeadingZeros(uint32_t hash) {
        uint32_t w = hash << P;
        if (w == 0) return 32 - P + 1;
        return __builtin_clz(w) + 1;
    }
    
    // FNV-1a hash
    static uint32_t fnv1a(const void* data, size_t len) {
        const uint8_t* bytes = static_cast<const uint8_t*>(data);
        uint32_t hash = 2166136261u;
        for (size_t i = 0; i < len; i++) {
            hash ^= bytes[i];
            hash *= 16777619u;
        }
        return hash;
    }

public:
    HyperLogLog() : registers(M, 0) {}
    
    void add(const void* data, size_t len) {
        uint32_t hash = fnv1a(data, len);
        uint32_t idx = hash >> (32 - P);
        uint8_t zeros = countLeadingZeros(hash);
        
        if (zeros > registers[idx]) {
            registers[idx] = zeros;
        }
    }
    
    void add(const std::string& s) {
        add(s.data(), s.size());
    }
    
    double estimate() const {
        // Raw harmonic mean estimate
        double sum = 0.0;
        int emptyRegisters = 0;
        
        for (int j = 0; j < M; j++) {
            sum += std::pow(2.0, -registers[j]);
            if (registers[j] == 0) emptyRegisters++;
        }
        
        double E = ALPHA * M * M / sum;
        
        // Small range correction (linear counting)
        if (E <= 2.5 * M && emptyRegisters > 0) {
            E = M * std::log(static_cast<double>(M) / emptyRegisters);
        }
        
        // Large range correction (hash collision)
        constexpr double TWO_32 = 4294967296.0;
        if (E > TWO_32 / 30.0) {
            E = -TWO_32 * std::log(1.0 - E / TWO_32);
        }
        
        return E;
    }
    
    // Merge another HLL into this one
    void merge(const HyperLogLog& other) {
        for (int j = 0; j < M; j++) {
            if (other.registers[j] > registers[j]) {
                registers[j] = other.registers[j];
            }
        }
    }
    
    // Memory usage in bytes
    static constexpr size_t memoryUsage() { return M; }
};

Usage Example

#include <iostream>

int main() {
    HyperLogLog hll;
    
    // Simulate 1 million unique users
    for (int i = 0; i < 1000000; i++) {
        std::string user = "user:" + std::to_string(i);
        hll.add(user);
    }
    
    std::cout << "Estimated: " << hll.estimate() << std::endl;
    std::cout << "Actual: 1000000" << std::endl;
    std::cout << "Memory: " << HyperLogLog::memoryUsage() << " bytes" << std::endl;
    
    return 0;
}
// Output:
// Estimated: 1008432 (~0.8% error)
// Actual: 1000000
// Memory: 16384 bytes

Algorithm Evolution

HyperLogLog didn't appear from nowhere. Here's the lineage:

Year	Algorithm	Innovation	Error
1985	Flajolet-Martin	Bit pattern observation	High variance
2003	LogLog	Bucketing with geometric mean	$1.30/\sqrt{m}$
2007	HyperLogLog	Harmonic mean	$1.04/\sqrt{m}$
2013	HyperLogLog++ (Google)	Sparse representation + bias correction	$< 1.04/\sqrt{m}$

Google's HLL++ paper introduced:

Sparse representation for low cardinalities (exact counting)
Empirical bias correction for intermediate cardinalities
Both improvements are in BigQuery's implementation

Complexity Analysis

Operation	Time	Space	Notes
Add	$O(1)$	—	Hash + bit ops + array write
Estimate	$O(m)$	—	Sum over all registers
Merge	$O(m)$	—	Element-wise max
Storage	—	$O(m)$	Fixed 16KB for $m=16384$

The $O(\log \log N)$ in the name comes from the register size: to track leading zeros up to 64 bits, each register needs only $\log_2(64) = 6$ bits.

Conclusion

HyperLogLog is one of those rare algorithms where theory and practice align perfectly. It's not just academically elegant — it's the backbone of analytics at Google, Meta, Netflix, Amazon, and Microsoft.

The next time you see a dashboard showing "3.2 billion unique users," remember: behind that number is a 12KB data structure and a beautiful application of probability theory.

For the systems engineer, HLL is a reminder that sometimes the best solution isn't more hardware — it's smarter mathematics.

HyperLogLog: How Big Tech Counts Billions with Kilobytes