Database Sharding - Scale Out with Horizontal Partitioning | Concepts

What is Sharding

Sharding is a technique that horizontally partitions data across multiple databases (shards). It handles massive amounts of data and traffic that a single database cannot process.

Difference from Replication: Replication keeps copies of the same data on multiple servers, while sharding distributes different data across different servers.

How Sharding Works

flowchart TB
    subgraph Before["Before Sharding"]
        Single["Single Database<br/>All user data<br/>(100 million records)"]
    end

    subgraph After["After Sharding"]
        S0["Shard 0<br/>Users A-F<br/>25M"]
        S1["Shard 1<br/>Users G-L<br/>25M"]
        S2["Shard 2<br/>Users M-R<br/>25M"]
        S3["Shard 3<br/>Users S-Z<br/>25M"]
    end

    Before --> After

Sharding Methods

Range-based Sharding

Determines shard by key range.

user_id Range	Shard
1-1,000,000	Shard 0
1,000,001-2,000,000	Shard 1
2,000,001-3,000,000	Shard 2

Advantages: Efficient range queries Disadvantages: Prone to data skew

Hash-based Sharding

Determines shard by key hash value.

shard_id = hash(user_id) % num_shards

Hash Calculation	Result
hash(“user_123”) % 4 = 2	Shard 2
hash(“user_456”) % 4 = 0	Shard 0

Advantages: Even data distribution Disadvantages: Inefficient range queries

Directory-based Sharding

Uses a lookup table to determine shard.

user_id	shard
user_123	Shard 2
user_456	Shard 0
user_789	Shard 1

Advantages: Flexible mapping Disadvantages: Lookup overhead

Shard Key Selection

The shard key is the most critical factor determining sharding success.

Good Shard Key Criteria

Criteria
✓ High cardinality (diverse values)
✓ Matches access patterns
✓ Evenly distributes writes
✓ Included in most queries

Shard Key Examples

Use Case	Good Shard Key	Bad Shard Key
Multi-tenant SaaS	tenant_id	created_at
E-commerce	user_id	product_category
SNS	user_id	post_date
Log Analysis	Time + identifier composite	Time only

Hotspot Problem

Example	Result
Bad: Sharding by created_at	Writes concentrate on latest data shard
Good: user_id + created_at composite key	Writes are distributed

Cross-Shard Queries

Queries spanning multiple shards become complex.

Scatter-Gather

flowchart TB
    Client["Client"] --> Router["Router"]
    Router -->|"Scatter (Query all shards)"| S0["S0"]
    Router --> S1["S1"]
    Router --> S2["S2"]
    Router --> S3["S3"]
    S0 -->|"Gather (Aggregate results)"| Router
    S1 --> Router
    S2 --> Router
    S3 --> Router
    Router --> Client

JOIN Limitations

-- Within same shard: Normal JOIN
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.id = 123;  -- Contains shard key

-- Cross-shard: Join on application side
-- 1. Get data from users table
-- 2. Get data from orders table
-- 3. JOIN on application side

Rebalancing

Redistributing data between shards.

When It’s Needed

Adding/removing shards
Resolving data skew
Performance optimization

Consistent Hashing

A technique that minimizes data movement when shard count changes.

Hashing Type	Shards 4→5 Change	Data Movement
Traditional hashing	Shards 4→5	About 80% of data moves
Consistent hashing	Shards 4→5	Only about 20% of data moves

Sharding Challenges

Challenge	Solution
Transaction complexity	2-phase commit, Saga pattern
JOIN limitations	Data denormalization, application-side joins
Schema change difficulty	Online DDL tools
Operational complexity	Use managed services
Backup and recovery	Per-shard strategy

Managed Sharding

Service	Features
Amazon Aurora	Auto-sharding support
CockroachDB	Distributed SQL, auto-rebalancing
Vitess	MySQL-compatible, developed by YouTube
MongoDB	Built-in sharding
TiDB	MySQL-compatible, NewSQL

Sharding Decision Criteria

When sharding is needed	Alternatives to consider first
Exceeding single DB capacity limits	Add read replicas
Write throughput at limit	Implement caching
Vertical scaling cost too high	Query optimization
	Index review
	Migrate to larger instance

Summary

Sharding is a powerful technique for horizontal database scaling, but it comes with complexity. Shard key selection is the key to success—thoroughly analyze access patterns before deciding to implement. When possible, consider managed services to reduce operational burden.

← Back to list