What is Sharding
Sharding is a technique that horizontally partitions data across multiple databases (shards). It handles massive amounts of data and traffic that a single database cannot process.
Difference from Replication: Replication keeps copies of the same data on multiple servers, while sharding distributes different data across different servers.
How Sharding Works
flowchart TB
subgraph Before["Before Sharding"]
Single["Single Database<br/>All user data<br/>(100 million records)"]
end
subgraph After["After Sharding"]
S0["Shard 0<br/>Users A-F<br/>25M"]
S1["Shard 1<br/>Users G-L<br/>25M"]
S2["Shard 2<br/>Users M-R<br/>25M"]
S3["Shard 3<br/>Users S-Z<br/>25M"]
end
Before --> After
Sharding Methods
Range-based Sharding
Determines shard by key range.
| user_id Range | Shard |
|---|---|
| 1-1,000,000 | Shard 0 |
| 1,000,001-2,000,000 | Shard 1 |
| 2,000,001-3,000,000 | Shard 2 |
Advantages: Efficient range queries Disadvantages: Prone to data skew
Hash-based Sharding
Determines shard by key hash value.
shard_id = hash(user_id) % num_shards
| Hash Calculation | Result |
|---|---|
| hash(“user_123”) % 4 = 2 | Shard 2 |
| hash(“user_456”) % 4 = 0 | Shard 0 |
Advantages: Even data distribution Disadvantages: Inefficient range queries
Directory-based Sharding
Uses a lookup table to determine shard.
| user_id | shard |
|---|---|
| user_123 | Shard 2 |
| user_456 | Shard 0 |
| user_789 | Shard 1 |
Advantages: Flexible mapping Disadvantages: Lookup overhead
Shard Key Selection
The shard key is the most critical factor determining sharding success.
Good Shard Key Criteria
| Criteria |
|---|
| ✓ High cardinality (diverse values) |
| ✓ Matches access patterns |
| ✓ Evenly distributes writes |
| ✓ Included in most queries |
Shard Key Examples
| Use Case | Good Shard Key | Bad Shard Key |
|---|---|---|
| Multi-tenant SaaS | tenant_id | created_at |
| E-commerce | user_id | product_category |
| SNS | user_id | post_date |
| Log Analysis | Time + identifier composite | Time only |
Hotspot Problem
| Example | Result |
|---|---|
| Bad: Sharding by created_at | Writes concentrate on latest data shard |
| Good: user_id + created_at composite key | Writes are distributed |
Cross-Shard Queries
Queries spanning multiple shards become complex.
Scatter-Gather
flowchart TB
Client["Client"] --> Router["Router"]
Router -->|"Scatter (Query all shards)"| S0["S0"]
Router --> S1["S1"]
Router --> S2["S2"]
Router --> S3["S3"]
S0 -->|"Gather (Aggregate results)"| Router
S1 --> Router
S2 --> Router
S3 --> Router
Router --> Client
JOIN Limitations
-- Within same shard: Normal JOIN
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.id = 123; -- Contains shard key
-- Cross-shard: Join on application side
-- 1. Get data from users table
-- 2. Get data from orders table
-- 3. JOIN on application side
Rebalancing
Redistributing data between shards.
When It’s Needed
- Adding/removing shards
- Resolving data skew
- Performance optimization
Consistent Hashing
A technique that minimizes data movement when shard count changes.
| Hashing Type | Shards 4→5 Change | Data Movement |
|---|---|---|
| Traditional hashing | Shards 4→5 | About 80% of data moves |
| Consistent hashing | Shards 4→5 | Only about 20% of data moves |
Sharding Challenges
| Challenge | Solution |
|---|---|
| Transaction complexity | 2-phase commit, Saga pattern |
| JOIN limitations | Data denormalization, application-side joins |
| Schema change difficulty | Online DDL tools |
| Operational complexity | Use managed services |
| Backup and recovery | Per-shard strategy |
Managed Sharding
| Service | Features |
|---|---|
| Amazon Aurora | Auto-sharding support |
| CockroachDB | Distributed SQL, auto-rebalancing |
| Vitess | MySQL-compatible, developed by YouTube |
| MongoDB | Built-in sharding |
| TiDB | MySQL-compatible, NewSQL |
Sharding Decision Criteria
| When sharding is needed | Alternatives to consider first |
|---|---|
| Exceeding single DB capacity limits | Add read replicas |
| Write throughput at limit | Implement caching |
| Vertical scaling cost too high | Query optimization |
| Index review | |
| Migrate to larger instance |
Summary
Sharding is a powerful technique for horizontal database scaling, but it comes with complexity. Shard key selection is the key to success—thoroughly analyze access patterns before deciding to implement. When possible, consider managed services to reduce operational burden.
← Back to list