December 25, 202512 min read

Bloom Filters: The Story of Knowing What You Don't Know

System DesignAlgorithmsC++Big Data

In 1970, a computer scientist named Burton Howard Bloom published a five-page paper in the Communications of the ACM titled "Space/Time Trade-offs in Hash Coding with Allowable Errors." It described a deceptively simple data structure that would, decades later, become foundational to internet-scale systems.

The question is not whether a Bloom filter will give you wrong answers. It will. The question is whether those wrong answers matter.

This is the story of how a probabilistic data structure designed to save tape storage in the 1970s became essential to Google, Amazon, Facebook, Ethereum, and nearly every database you use today.


The Origin: A 1970s Tape Storage Problem

Burton Bloom was working on a problem that seems quaint today: hyphenation dictionaries. Specifically, how do you store 500,000 English words for a hyphenation algorithm without using too much expensive magnetic tape?

The conventional solution was hash coding — converting each word to a hash and storing it in a table. But hash tables have a problem: collisions. When two words hash to the same location, you need either:

  • More space (larger table)
  • More time (collision resolution)

Bloom asked a different question: What if we accepted some errors?

His insight was radical for its time. In many applications, a small probability of "false positives" (saying a word exists when it doesn't) is acceptable, as long as you never have "false negatives" (saying a word doesn't exist when it does).

The Original Algorithm

Bloom's solution used a bit array and multiple hash functions:

  1. Create a bit array of mm bits, all initially 0
  2. Use kk independent hash functions, each mapping elements to positions [0,m1][0, m-1]
  3. To add element xx: set bits at positions h1(x),h2(x),...,hk(x)h_1(x), h_2(x), ..., h_k(x) to 1
  4. To query element xx: check if all bits at those positions are 1

If all bits are 1, the element is "probably" in the set. If any bit is 0, the element is "definitely not" in the set.

The paper provided the mathematical analysis:

p=(1ekn/m)kp = \left(1 - e^{-kn/m}\right)^k

Where pp is the false positive probability, nn is the number of elements, mm is the bit array size, and kk is the number of hash functions.

This single equation defined the space/time trade-off that would power systems for the next 50+ years.


The Database Revolution: LevelDB, RocksDB, and Cassandra

For nearly three decades, Bloom filters remained a curiosity. This changed with the internet and the rise of Big Data.

The LSM-Tree Read Amplification Problem

In 1996, the Log-Structured Merge-Tree (LSM-Tree) was introduced. It powered write-heavy databases but suffered from read amplification. Data is spread across multiple sorted files (SSTables). To find a key, you might need to check every file:

Read Query: "Does user:12345 exist?"

Check Memtable       → Not found
Check SSTable Level 0 → Not found  
Check SSTable Level 1 → Not found
...
Check SSTable Level N → Not found

Total: N disk reads for a negative result

For a key that doesn't exist, you've done maximum work for zero benefit.

Google's LevelDB & Facebook's RocksDB

Google engineers Jeff Dean and Sanjay Ghemawat added Bloom filters to LevelDB's SSTables. Before reading a file, LevelDB checks its Bloom filter:

  • If the filter says "definitely not here" → skip the file
  • If the filter says "maybe here" → read the file

With a 1% false positive rate, 99% of negative queries avoid the disk read entirely.

Facebook later forked LevelDB to create RocksDB, enhancing the implementation with "Full Filters" (one filter per SSTable) for faster lookups, powering systems at Netflix, Uber, and LinkedIn.

Apache Cassandra

Facebook also built Cassandra for Inbox Search (billions of messages). It uses the same specific optimization:

// Cassandra's read path (simplified)
public boolean isPresent(FilterKey key) {
    if (!bitset.get(hash(key))) return false; // Definitely not present
    return true; // Possibly present, check disk
}

Cassandra documentation explicitly states: "Bloom filters save I/O reads by allowing the database to skip reading data files that do not contain the requested partition."


Google Bigtable and Cloud Scale

Google's Bigtable paper made Bloom filters famous in the systems community:

We use Bloom filters to reduce the number of disk accesses for read operations. A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair.

Engineering Optimizations at Scale

Google Cloud recently published notes on their modern Bloom filter optimizations:

  1. Hybrid Bloom Filters: Solved low utilization by creating filters that work even when queries only specify a column family (not just column). Result: 4x increase in utilization.
  2. Bloom Filter Index Cache: A custom in-memory index to access filters without binary search overhead. Result: 20-50% improvement in single-row read throughput.

Beyond Databases: The Bloom Filter Everywhere

Akamai CDN: The "One-Hit-Wonder" Gate

Akamai discovered that 75% of cached objects are accessed exactly once. Caching these wastes disk space.

The Solution: A Bloom filter gate.

  • First request: Check filter. Not found? Add to filter, but don't cache. Serve from origin.
  • Second request: Check filter. Found? Cache it.

This "second-chance" caching reduces disk writes by 75% and extends SSD lifespan.

Google Chrome: Safe Browsing

Chrome protects users from malicious URLs without sending every visited site to Google (privacy).

  • Chrome downloads a compact Bloom filter of known hash prefixes.
  • When you visit a site, it checks the local filter.
  • Not in filter? Safe. (99.9% of cases).
  • In filter? Only then does it contact Google for a full check.

Ethereum: Log Filtering

Every Ethereum block header contains a 2048-bit Bloom filter indexing addresses and event topics.

  • When querying for "USDC Transfer events" across millions of blocks, nodes check the header Bloom filter first.
  • If the filter says "no USDC events here," the entire block processing is skipped.

PostgreSQL: bloom Index

PostgreSQL 9.6 introduced bloom indexes for wide tables. Instead of 6 separate B-tree indexes for name, email, phone, etc., you create one bloom index. It supports any combination of equality queries (WHERE name='Alice' or WHERE zip='10001') with a fraction of the space.


First Principles: The Mathematics

Let's derive why this works.

False Positive Probability

The probability that a specific bit is NOT set by one hash function is 11/m1 - 1/m. With kk hash functions, the probability a bit is 0 is approximately ekn/me^{-kn/m}.

For a query to be a false positive, all kk bits must be 1 (by pure chance). The probability pp is:

p(1ekn/m)kp \approx \left(1 - e^{-kn/m}\right)^k

Optimal Parameters

To minimize false positives for a given size mm and count nn, the optimal number of hash functions kk is:

kopt=mnln20.693mnk_{\text{opt}} = \frac{m}{n} \ln 2 \approx 0.693 \frac{m}{n}

Space required for a target false positive rate pp:

m1.44nlog2pm \approx -1.44 n \log_2 p

Rule of Thumb:

  • 1% false positive rate requires 9.6 bits per element.
  • 0.1% false positive rate requires 14.4 bits per element.
  • A standard hash set requires 64+ bits per element. Bloom filters offer an ~85% memory reduction.

Complete C++ Implementation

Here is a production-grade implementation using MurmurHash3-style mixing and enhanced double hashing:

#include <cstdint>
#include <cmath>
#include <vector>
#include <string>
#include <array>
#include <stdexcept>
#include <cstring>
#include <algorithm>

class BloomFilter {
private:
    std::vector<uint64_t> bits_;  // Bit array stored as 64-bit words
    size_t numBits_;              // Total number of bits
    size_t numHashes_;            // Number of hash functions
    size_t numElements_;          // Elements inserted (for statistics)
    
    // MurmurHash3 64-bit finalizer
    static uint64_t murmurFinalize(uint64_t h) {
        h ^= h >> 33;
        h *= 0xff51afd7ed558ccdULL;
        h ^= h >> 33;
        h *= 0xc4ceb9fe1a85ec53ULL;
        h ^= h >> 33;
        return h;
    }
    
    // Generate k hash values using enhanced double hashing
    void getHashes(const void* data, size_t len, uint64_t* hashes) const {
        // Compute two base hashes using MurmurHash3-style mixing
        uint64_t h1 = 0x9e3779b97f4a7c15ULL;  // Golden ratio
        uint64_t h2 = 0xbf58476d1ce4e5b9ULL;  // Random prime
        
        const uint8_t* bytes = static_cast<const uint8_t*>(data);
        for (size_t i = 0; i < len; i++) {
            h1 ^= bytes[i];
            h1 *= 0x87c37b91114253d5ULL;
            h1 = (h1 << 31) | (h1 >> 33);
            
            h2 ^= bytes[i];
            h2 *= 0x4cf5ad432745937fULL;
            h2 = (h2 << 29) | (h2 >> 35);
        }
        
        h1 = murmurFinalize(h1);
        h2 = murmurFinalize(h2);
        
        // Generate k hashes
        for (size_t i = 0; i < numHashes_; i++) {
            hashes[i] = (h1 + i * h2 + i * i) % numBits_;
        }
    }
    
    void setBit(size_t pos) {
        bits_[pos / 64] |= (1ULL << (pos % 64));
    }
    
    bool getBit(size_t pos) const {
        return (bits_[pos / 64] & (1ULL << (pos % 64))) != 0;
    }

public:
    // Create filter for expectedElements with targetFpRate false positive rate
    BloomFilter(size_t expectedElements, double targetFpRate = 0.01) 
        : numElements_(0) {
        
        double ln2sq = 0.4804530139182014;  // (ln 2)^2
        double m = -static_cast<double>(expectedElements) * std::log(targetFpRate) / ln2sq;
        numBits_ = static_cast<size_t>(std::ceil(m));
        numBits_ = std::max(numBits_, size_t(64));
        
        double k = (static_cast<double>(numBits_) / expectedElements) * 0.6931471805599453;
        numHashes_ = static_cast<size_t>(std::round(k));
        numHashes_ = std::max(numHashes_, size_t(1));
        numHashes_ = std::min(numHashes_, size_t(32));
        
        bits_.resize((numBits_ + 63) / 64, 0);
    }
    
    void add(const std::string& s) {
        std::array<uint64_t, 32> hashes;
        getHashes(s.data(), s.size(), hashes.data());
        for (size_t i = 0; i < numHashes_; i++) setBit(hashes[i]);
        numElements_++;
    }
    
    bool possiblyContains(const std::string& s) const {
        std::array<uint64_t, 32> hashes;
        getHashes(s.data(), s.size(), hashes.data());
        for (size_t i = 0; i < numHashes_; i++) {
            if (!getBit(hashes[i])) return false;
        }
        return true;
    }
    
    size_t sizeInBytes() const { return bits_.size() * 8; }
};

Evolution: Counting, Cuckoo, and Ribbon Filters

The standard Bloom filter was just the beginning.

  1. Counting Bloom Filter (1998): Replaced bits with counters to allow deletion.
  2. Cuckoo Filter (2014): Uses Cuckoo hashing for better space efficiency and deletion support. Now available in Redis Stack.
  3. Ribbon Filter (2021): Meta's internal optimization, 30% smaller than Bloom filters for the same accuracy, used in RocksDB.

Conclusion

Burton Bloom's 1970 paper asked a simple question: "What if we accept some errors?" Fifty-four years later, that question powers systems serving billions of users.

For the systems engineer, the Bloom filter is a reminder: sometimes the most valuable thing a system can do is nothing at all.

Share: