Bloom Filters: The Story of Knowing What You Don't Know
In 1970, a computer scientist named Burton Howard Bloom published a five-page paper in the Communications of the ACM titled "Space/Time Trade-offs in Hash Coding with Allowable Errors." It described a deceptively simple data structure that would, decades later, become foundational to internet-scale systems.
The question is not whether a Bloom filter will give you wrong answers. It will. The question is whether those wrong answers matter.
This is the story of how a probabilistic data structure designed to save tape storage in the 1970s became essential to Google, Amazon, Facebook, Ethereum, and nearly every database you use today.
The Origin: A 1970s Tape Storage Problem
Burton Bloom was working on a problem that seems quaint today: hyphenation dictionaries. Specifically, how do you store 500,000 English words for a hyphenation algorithm without using too much expensive magnetic tape?
The conventional solution was hash coding — converting each word to a hash and storing it in a table. But hash tables have a problem: collisions. When two words hash to the same location, you need either:
- More space (larger table)
- More time (collision resolution)
Bloom asked a different question: What if we accepted some errors?
His insight was radical for its time. In many applications, a small probability of "false positives" (saying a word exists when it doesn't) is acceptable, as long as you never have "false negatives" (saying a word doesn't exist when it does).
The Original Algorithm
Bloom's solution used a bit array and multiple hash functions:
- Create a bit array of bits, all initially 0
- Use independent hash functions, each mapping elements to positions
- To add element : set bits at positions to 1
- To query element : check if all bits at those positions are 1
If all bits are 1, the element is "probably" in the set. If any bit is 0, the element is "definitely not" in the set.
The paper provided the mathematical analysis:
Where is the false positive probability, is the number of elements, is the bit array size, and is the number of hash functions.
This single equation defined the space/time trade-off that would power systems for the next 50+ years.
The Database Revolution: LevelDB, RocksDB, and Cassandra
For nearly three decades, Bloom filters remained a curiosity. This changed with the internet and the rise of Big Data.
The LSM-Tree Read Amplification Problem
In 1996, the Log-Structured Merge-Tree (LSM-Tree) was introduced. It powered write-heavy databases but suffered from read amplification. Data is spread across multiple sorted files (SSTables). To find a key, you might need to check every file:
Read Query: "Does user:12345 exist?"
Check Memtable → Not found
Check SSTable Level 0 → Not found
Check SSTable Level 1 → Not found
...
Check SSTable Level N → Not found
Total: N disk reads for a negative result
For a key that doesn't exist, you've done maximum work for zero benefit.
Google's LevelDB & Facebook's RocksDB
Google engineers Jeff Dean and Sanjay Ghemawat added Bloom filters to LevelDB's SSTables. Before reading a file, LevelDB checks its Bloom filter:
- If the filter says "definitely not here" → skip the file
- If the filter says "maybe here" → read the file
With a 1% false positive rate, 99% of negative queries avoid the disk read entirely.
Facebook later forked LevelDB to create RocksDB, enhancing the implementation with "Full Filters" (one filter per SSTable) for faster lookups, powering systems at Netflix, Uber, and LinkedIn.
Apache Cassandra
Facebook also built Cassandra for Inbox Search (billions of messages). It uses the same specific optimization:
// Cassandra's read path (simplified)
public boolean isPresent(FilterKey key) {
if (!bitset.get(hash(key))) return false; // Definitely not present
return true; // Possibly present, check disk
}
Cassandra documentation explicitly states: "Bloom filters save I/O reads by allowing the database to skip reading data files that do not contain the requested partition."
Google Bigtable and Cloud Scale
Google's Bigtable paper made Bloom filters famous in the systems community:
We use Bloom filters to reduce the number of disk accesses for read operations. A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair.
Engineering Optimizations at Scale
Google Cloud recently published notes on their modern Bloom filter optimizations:
- Hybrid Bloom Filters: Solved low utilization by creating filters that work even when queries only specify a column family (not just column). Result: 4x increase in utilization.
- Bloom Filter Index Cache: A custom in-memory index to access filters without binary search overhead. Result: 20-50% improvement in single-row read throughput.
Beyond Databases: The Bloom Filter Everywhere
Akamai CDN: The "One-Hit-Wonder" Gate
Akamai discovered that 75% of cached objects are accessed exactly once. Caching these wastes disk space.
The Solution: A Bloom filter gate.
- First request: Check filter. Not found? Add to filter, but don't cache. Serve from origin.
- Second request: Check filter. Found? Cache it.
This "second-chance" caching reduces disk writes by 75% and extends SSD lifespan.
Google Chrome: Safe Browsing
Chrome protects users from malicious URLs without sending every visited site to Google (privacy).
- Chrome downloads a compact Bloom filter of known hash prefixes.
- When you visit a site, it checks the local filter.
- Not in filter? Safe. (99.9% of cases).
- In filter? Only then does it contact Google for a full check.
Ethereum: Log Filtering
Every Ethereum block header contains a 2048-bit Bloom filter indexing addresses and event topics.
- When querying for "USDC Transfer events" across millions of blocks, nodes check the header Bloom filter first.
- If the filter says "no USDC events here," the entire block processing is skipped.
PostgreSQL: bloom Index
PostgreSQL 9.6 introduced bloom indexes for wide tables. Instead of 6 separate B-tree indexes for name, email, phone, etc., you create one bloom index. It supports any combination of equality queries (WHERE name='Alice' or WHERE zip='10001') with a fraction of the space.
First Principles: The Mathematics
Let's derive why this works.
False Positive Probability
The probability that a specific bit is NOT set by one hash function is . With hash functions, the probability a bit is 0 is approximately .
For a query to be a false positive, all bits must be 1 (by pure chance). The probability is:
Optimal Parameters
To minimize false positives for a given size and count , the optimal number of hash functions is:
Space required for a target false positive rate :
Rule of Thumb:
- 1% false positive rate requires 9.6 bits per element.
- 0.1% false positive rate requires 14.4 bits per element.
- A standard hash set requires 64+ bits per element. Bloom filters offer an ~85% memory reduction.
Complete C++ Implementation
Here is a production-grade implementation using MurmurHash3-style mixing and enhanced double hashing:
#include <cstdint>
#include <cmath>
#include <vector>
#include <string>
#include <array>
#include <stdexcept>
#include <cstring>
#include <algorithm>
class BloomFilter {
private:
std::vector<uint64_t> bits_; // Bit array stored as 64-bit words
size_t numBits_; // Total number of bits
size_t numHashes_; // Number of hash functions
size_t numElements_; // Elements inserted (for statistics)
// MurmurHash3 64-bit finalizer
static uint64_t murmurFinalize(uint64_t h) {
h ^= h >> 33;
h *= 0xff51afd7ed558ccdULL;
h ^= h >> 33;
h *= 0xc4ceb9fe1a85ec53ULL;
h ^= h >> 33;
return h;
}
// Generate k hash values using enhanced double hashing
void getHashes(const void* data, size_t len, uint64_t* hashes) const {
// Compute two base hashes using MurmurHash3-style mixing
uint64_t h1 = 0x9e3779b97f4a7c15ULL; // Golden ratio
uint64_t h2 = 0xbf58476d1ce4e5b9ULL; // Random prime
const uint8_t* bytes = static_cast<const uint8_t*>(data);
for (size_t i = 0; i < len; i++) {
h1 ^= bytes[i];
h1 *= 0x87c37b91114253d5ULL;
h1 = (h1 << 31) | (h1 >> 33);
h2 ^= bytes[i];
h2 *= 0x4cf5ad432745937fULL;
h2 = (h2 << 29) | (h2 >> 35);
}
h1 = murmurFinalize(h1);
h2 = murmurFinalize(h2);
// Generate k hashes
for (size_t i = 0; i < numHashes_; i++) {
hashes[i] = (h1 + i * h2 + i * i) % numBits_;
}
}
void setBit(size_t pos) {
bits_[pos / 64] |= (1ULL << (pos % 64));
}
bool getBit(size_t pos) const {
return (bits_[pos / 64] & (1ULL << (pos % 64))) != 0;
}
public:
// Create filter for expectedElements with targetFpRate false positive rate
BloomFilter(size_t expectedElements, double targetFpRate = 0.01)
: numElements_(0) {
double ln2sq = 0.4804530139182014; // (ln 2)^2
double m = -static_cast<double>(expectedElements) * std::log(targetFpRate) / ln2sq;
numBits_ = static_cast<size_t>(std::ceil(m));
numBits_ = std::max(numBits_, size_t(64));
double k = (static_cast<double>(numBits_) / expectedElements) * 0.6931471805599453;
numHashes_ = static_cast<size_t>(std::round(k));
numHashes_ = std::max(numHashes_, size_t(1));
numHashes_ = std::min(numHashes_, size_t(32));
bits_.resize((numBits_ + 63) / 64, 0);
}
void add(const std::string& s) {
std::array<uint64_t, 32> hashes;
getHashes(s.data(), s.size(), hashes.data());
for (size_t i = 0; i < numHashes_; i++) setBit(hashes[i]);
numElements_++;
}
bool possiblyContains(const std::string& s) const {
std::array<uint64_t, 32> hashes;
getHashes(s.data(), s.size(), hashes.data());
for (size_t i = 0; i < numHashes_; i++) {
if (!getBit(hashes[i])) return false;
}
return true;
}
size_t sizeInBytes() const { return bits_.size() * 8; }
};
Evolution: Counting, Cuckoo, and Ribbon Filters
The standard Bloom filter was just the beginning.
- Counting Bloom Filter (1998): Replaced bits with counters to allow deletion.
- Cuckoo Filter (2014): Uses Cuckoo hashing for better space efficiency and deletion support. Now available in Redis Stack.
- Ribbon Filter (2021): Meta's internal optimization, 30% smaller than Bloom filters for the same accuracy, used in RocksDB.
Conclusion
Burton Bloom's 1970 paper asked a simple question: "What if we accept some errors?" Fifty-four years later, that question powers systems serving billions of users.
For the systems engineer, the Bloom filter is a reminder: sometimes the most valuable thing a system can do is nothing at all.
Related Articles
HyperLogLog: How Big Tech Counts Billions with Kilobytes
How Google, Meta, Netflix, and Amazon count billions of unique users with 12KB of memory. A case-study driven deep-dive into HyperLogLog, from Mag7 production systems to a complete C++ implementation.
TAO: How Facebook Built a Graph Database for 3 Billion People
The engineering story behind TAO — Facebook's distributed graph store handling 400+ billion edges, 1 billion queries/second, and sub-millisecond latency. From the MySQL scaling crisis to a custom planet-scale system.
System Design: Building Facebook from Scratch
A comprehensive engineering blueprint for building Facebook at scale — covering social graph, news feed, messaging, media storage, ads, and the infrastructure serving 3 billion users.