Database Sharding - Scale Out with Horizontal Partitioning

14 min read | 2024.12.31

What is Sharding

Sharding is a technique that horizontally partitions data across multiple databases (shards). It handles massive amounts of data and traffic that a single database cannot process.

Difference from Replication: Replication keeps copies of the same data on multiple servers, while sharding distributes different data across different servers.

How Sharding Works

flowchart TB
    subgraph Before["Before Sharding"]
        Single["Single Database<br/>All user data<br/>(100 million records)"]
    end

    subgraph After["After Sharding"]
        S0["Shard 0<br/>Users A-F<br/>25M"]
        S1["Shard 1<br/>Users G-L<br/>25M"]
        S2["Shard 2<br/>Users M-R<br/>25M"]
        S3["Shard 3<br/>Users S-Z<br/>25M"]
    end

    Before --> After

Sharding Methods

Range-based Sharding

Determines shard by key range.

user_id RangeShard
1-1,000,000Shard 0
1,000,001-2,000,000Shard 1
2,000,001-3,000,000Shard 2

Advantages: Efficient range queries Disadvantages: Prone to data skew

Hash-based Sharding

Determines shard by key hash value.

shard_id = hash(user_id) % num_shards
Hash CalculationResult
hash(“user_123”) % 4 = 2Shard 2
hash(“user_456”) % 4 = 0Shard 0

Advantages: Even data distribution Disadvantages: Inefficient range queries

Directory-based Sharding

Uses a lookup table to determine shard.

user_idshard
user_123Shard 2
user_456Shard 0
user_789Shard 1

Advantages: Flexible mapping Disadvantages: Lookup overhead

Shard Key Selection

The shard key is the most critical factor determining sharding success.

Good Shard Key Criteria

Criteria
✓ High cardinality (diverse values)
✓ Matches access patterns
✓ Evenly distributes writes
✓ Included in most queries

Shard Key Examples

Use CaseGood Shard KeyBad Shard Key
Multi-tenant SaaStenant_idcreated_at
E-commerceuser_idproduct_category
SNSuser_idpost_date
Log AnalysisTime + identifier compositeTime only

Hotspot Problem

ExampleResult
Bad: Sharding by created_atWrites concentrate on latest data shard
Good: user_id + created_at composite keyWrites are distributed

Cross-Shard Queries

Queries spanning multiple shards become complex.

Scatter-Gather

flowchart TB
    Client["Client"] --> Router["Router"]
    Router -->|"Scatter (Query all shards)"| S0["S0"]
    Router --> S1["S1"]
    Router --> S2["S2"]
    Router --> S3["S3"]
    S0 -->|"Gather (Aggregate results)"| Router
    S1 --> Router
    S2 --> Router
    S3 --> Router
    Router --> Client

JOIN Limitations

-- Within same shard: Normal JOIN
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.id = 123;  -- Contains shard key

-- Cross-shard: Join on application side
-- 1. Get data from users table
-- 2. Get data from orders table
-- 3. JOIN on application side

Rebalancing

Redistributing data between shards.

When It’s Needed

  • Adding/removing shards
  • Resolving data skew
  • Performance optimization

Consistent Hashing

A technique that minimizes data movement when shard count changes.

Hashing TypeShards 4→5 ChangeData Movement
Traditional hashingShards 4→5About 80% of data moves
Consistent hashingShards 4→5Only about 20% of data moves

Sharding Challenges

ChallengeSolution
Transaction complexity2-phase commit, Saga pattern
JOIN limitationsData denormalization, application-side joins
Schema change difficultyOnline DDL tools
Operational complexityUse managed services
Backup and recoveryPer-shard strategy

Managed Sharding

ServiceFeatures
Amazon AuroraAuto-sharding support
CockroachDBDistributed SQL, auto-rebalancing
VitessMySQL-compatible, developed by YouTube
MongoDBBuilt-in sharding
TiDBMySQL-compatible, NewSQL

Sharding Decision Criteria

When sharding is neededAlternatives to consider first
Exceeding single DB capacity limitsAdd read replicas
Write throughput at limitImplement caching
Vertical scaling cost too highQuery optimization
Index review
Migrate to larger instance

Summary

Sharding is a powerful technique for horizontal database scaling, but it comes with complexity. Shard key selection is the key to success—thoroughly analyze access patterns before deciding to implement. When possible, consider managed services to reduce operational burden.

← Back to list