TAO: How Facebook Built a Graph Database for 3 Billion People
In 2009, Facebook engineers faced a crisis. Their social graph — the connections between 300 million users — was stored in MySQL. Every "friend request accepted" triggered cascading queries across sharded databases. Read latency was spiking. Cache invalidation was a nightmare. The infrastructure that had scaled Facebook from a dorm room to millions was now its biggest bottleneck.
The solution would become one of the most impressive feats in distributed systems engineering: TAO — The Associations and Objects store.
TAO handles 1 billion+ queries per second with sub-millisecond latency, serving a graph with 400+ billion edges across billions of objects.
This is the story of how Facebook built a database for the social graph of humanity.
The Problem: Why MySQL Couldn't Scale
The Memcache Band-Aid
By 2009, Facebook's architecture looked like this:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Web Tier │────▶│ Memcached │────▶│ MySQL │
│ (PHP/Hack) │ │ (Cache) │ │ (Sharded) │
└─────────────┘ └─────────────┘ └─────────────┘
The pattern was "look-aside caching":
- Check Memcached for data
- If miss: query MySQL, populate cache
- On write: update MySQL, invalidate cache
This worked at small scale. It collapsed at Facebook scale.
The Three Killers
1. Thundering Herds
When a cache key expired or was invalidated, hundreds of concurrent requests would simultaneously hit MySQL for the same data:
Cache Miss on popular user's friend list
├─▶ Request 1 → MySQL query
├─▶ Request 2 → MySQL query
├─▶ Request 3 → MySQL query
└─▶ ... 500 more requests → MySQL overwhelmed
2. Graph Query Complexity
Social graph queries are inherently multi-hop:
-- "Get friends of friends who liked this post"
SELECT DISTINCT u.*
FROM users u
JOIN friendships f1 ON f1.friend_id = u.id
JOIN friendships f2 ON f2.user_id = f1.user_id
JOIN post_likes pl ON pl.user_id = u.id
WHERE f2.friend_id = @current_user AND pl.post_id = @post_id
This single query could touch millions of rows across dozens of shards.
3. Cross-Shard Consistency Hell
When Alice befriends Bob:
- Write friendship record to Alice's shard
- Write reverse friendship to Bob's shard
- Invalidate cache on Alice's server
- Invalidate cache on Bob's server
- Invalidate "friends of Alice" cached lists
- Invalidate "friends of Bob" cached lists
One operation triggered 6+ distributed writes and invalidations. Race conditions were inevitable.
The Insight: Model Everything as Objects and Associations
The key insight was that almost everything in Facebook could be represented as a graph:
- Objects: Users, Posts, Photos, Pages, Comments
- Associations: Friendships, Likes, Tags, Follows
Instead of SQL tables, TAO uses a simple graph model:
Object:
┌─────────────────────────────────────────────────┐
│ id (64-bit) │ type │ data (blob) │ ver │
│ 4531234567 │ USER │ {name:...} │ 42 │
└─────────────────────────────────────────────────┘
Association:
┌──────────────────────────────────────────────────────────┐
│ id1 │ atype │ id2 │ timestamp │ data │
│ 123 │ FRIEND │ 456 │ 1609459200 │ {since: 2021} │
└──────────────────────────────────────────────────────────┘
The API Simplicity
TAO exposes just six core operations:
| Operation | Description | Example |
|---|---|---|
obj_get(id) | Get object by ID | Get user profile |
obj_set(id, data) | Create/update object | Update profile |
assoc_add(id1, atype, id2, data) | Create association | Add friend |
assoc_del(id1, atype, id2) | Delete association | Unfriend |
assoc_get(id1, atype) | Get all associations | Get friend list |
assoc_count(id1, atype) | Count associations | Friend count |
This minimal API handles 99% of Facebook's data access patterns.
The Architecture: Three Tiers of Intelligence
TAO's architecture is deceptively simple but engineered for planetary scale:
┌─────────────────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ (GraphQL, Feed Service, Messenger, Search) │
└─────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────────────┐
│ TAO LEADER CACHE │
│ - Handles ALL writes │
│ - Owns cache invalidation │
│ - One leader per shard │
└─────────────────────────────────────────────────────────────────────┘
│ │
┌───────────────────────┐ ┌───────────────────────┐
│ TAO Follower Cache │ │ TAO Follower Cache │
│ (Read replicas) │ │ (Read replicas) │
│ - Handles reads │ │ - Handles reads │
│ - Eventually │ │ - Geographically │
│ consistent │ │ distributed │
└───────────────────────┘ └───────────────────────┘
│ │
┌─────────────────────────────────────────────────────────────────────┐
│ MYSQL STORAGE │
│ - Persistent storage │
│ - Sharded by object_id │
│ - 1000s of shards │
└─────────────────────────────────────────────────────────────────────┘
Why This Works
Leader Cache: ALL writes go through a single leader per shard. This eliminates cache consistency problems — the leader is the source of truth.
Follower Caches: Read replicas in every data center. They subscribe to the leader for invalidations.
MySQL: Still the durable storage, but TAO absorbs 99%+ of reads before they reach MySQL.
The Hard Problems (And How They Solved Them)
Problem 1: Thundering Herds
Solution: In-Flight Request Coalescing
When multiple requests ask for the same data simultaneously, TAO coalesces them:
Incoming: assoc_get(user:123, FRIEND)
├─▶ Request 1 ─┐
├─▶ Request 2 ─┼──▶ Single cache lookup / MySQL query
├─▶ Request 3 ─┘
└─▶ All three get the same response
Implementation insight: TAO maintains a map of in-flight requests. New requests for the same key attach as "waiters" rather than triggering duplicate fetches.
Problem 2: Hot Spots (Celebrity Problem)
When Taylor Swift posts, millions fetch assoc_get(taylor_swift, LIKES) simultaneously. Even with coalescing, this overwhelms a single cache server.
Solution: Hierarchical Caching with Consistent Hashing
Taylor Swift's data:
│
├─▶ Primary Cache Server (owner)
│
└─▶ N Secondary Cache Servers (read-only replicas)
├─▶ Server A (handles 20% of reads)
├─▶ Server B (handles 20% of reads)
├─▶ Server C (handles 20% of reads)
├─▶ Server D (handles 20% of reads)
└─▶ Server E (handles 20% of reads)
Hot keys are automatically replicated across multiple cache tiers. The routing layer uses weighted random selection to spread load.
Problem 3: Association Lists (Pagination)
User with 5,000 friends — you can't load all 5,000 on every request.
Solution: Time-Ordered Association Lists
Associations are stored sorted by timestamp (descending). TAO provides:
assoc_range(id1, atype, offset, limit)
This returns associations in recency order with efficient pagination. The storage uses a B-tree index on (id1, atype, timestamp).
Problem 4: Bidirectional Edges
When Alice friends Bob:
- Store
(Alice, FRIEND, Bob)on Alice's shard - Store
(Bob, FRIEND, Alice)on Bob's shard
These must be atomically consistent.
Solution: Two-Phase Write with Background Repair
Phase 1: Write to Alice's shard (primary)
Phase 2: Async write to Bob's shard (inverse)
If Phase 2 fails:
- Background repair daemon detects missing inverse
- Re-applies the write
TAO accepts brief inconsistency (seconds) for availability. Background repair heals any divergence.
The Numbers That Matter
From Facebook's 2013 TAO paper and subsequent engineering talks:
| Metric | Value |
|---|---|
| Objects stored | Billions |
| Associations stored | 400+ billion edges |
| Read QPS | 1+ billion queries/second |
| Cache hit rate | 99.8% |
| Read latency (p50) | under 1ms |
| Read latency (p99) | under 5ms |
| Write latency | under 10ms |
| Data centers | 6+ global regions |
Why 99.8% Cache Hit Rate?
Social graph access patterns are highly skewed:
- 80% of reads are for data written in the last week
- Power-law distribution: 1% of objects get 50% of reads
- Friends lists change infrequently (high cache TTL)
TAO exploits this with tiered caching and aggressive prefetching.
What Engineers Can Learn
1. Optimize for Your Access Patterns
TAO's 6-operation API isn't a limitation — it's a feature. By constraining the API, Facebook could optimize the hell out of those specific patterns.
Lesson: Generic solutions (like raw SQL) are flexible but slow. Constrained solutions (like TAO's API) can be orders of magnitude faster.
2. Embrace Eventual Consistency (Where Safe)
TAO doesn't guarantee immediate consistency across data centers. If you're in Singapore and your friend accepts a request in New York, you might not see it for a few seconds.
Lesson: Users rarely notice sub-second inconsistency. Trading strict consistency for availability and latency is often the right call for social applications.
3. Put Smart Caches in Front of Dumb Storage
MySQL is "dumb" storage in TAO — it just persists data. The TAO cache layer handles all the hard problems: coalescing, hot spots, invalidation.
Lesson: Separate your caching logic from your storage logic. Let each layer do what it's good at.
4. Design for Failure Recovery, Not Failure Prevention
TAO's two-phase writes can fail mid-way. Instead of complex distributed transactions, TAO uses background repair.
Lesson: In distributed systems, failures are inevitable. Design systems that heal themselves rather than trying to prevent all failures.
The Legacy: TAO's Influence
TAO's ideas have spread throughout the industry:
- Google's Spanner: Similar association model for social features
- LinkedIn's Espresso: Graph-aware caching inspired by TAO
- Twitter's Manhattan: Adopted TAO's leader-follower cache pattern
- Open Source: Projects like Dgraph and JanusGraph draw from TAO's design
Even if you never build at Facebook's scale, understanding TAO teaches you how to think about graph storage, caching hierarchies, and consistency trade-offs.
Conclusion
TAO is one of the most impressive pieces of infrastructure engineering in the history of computing. It transformed a collapsing MySQL-based system into a planet-scale graph database serving billions of users.
The key insight wasn't a single breakthrough — it was a series of practical engineering decisions:
- Constrain the API to optimize for common patterns
- Use hierarchical caching with smart invalidation
- Accept eventual consistency where users won't notice
- Build self-healing systems instead of failure-proof ones
For engineers, TAO is a masterclass in building systems that work at the edge of what's possible.
"Any sufficiently advanced caching is indistinguishable from a distributed database."
— Paraphrasing Arthur C. Clarke
Further Reading
- TAO: Facebook's Distributed Data Store for the Social Graph — Original USENIX ATC 2013 paper
- Scaling Memcache at Facebook — The predecessor architecture
- Facebook Engineering Blog: TAO — Engineering announcement
Related Articles
System Design: Building Facebook from Scratch
A comprehensive engineering blueprint for building Facebook at scale — covering social graph, news feed, messaging, media storage, ads, and the infrastructure serving 3 billion users.
Photon: How Databricks Built a Vectorized Query Engine That Rewrites the Rules
A first-principles deep dive into Databricks Photon — the SIGMOD 2022 best industry paper. From row-at-a-time execution to vectorized batches, from JVM overhead to native C++, from CPU cache misses to SIMD acceleration.
HyperLogLog: How Big Tech Counts Billions with Kilobytes
How Google, Meta, Netflix, and Amazon count billions of unique users with 12KB of memory. A case-study driven deep-dive into HyperLogLog, from Mag7 production systems to a complete C++ implementation.