Database Sharding: Cost vs. Performance

Sharding can boost database performance, but it comes with trade-offs. Before diving in, consider this: sharding increases system complexity, operational overhead, and costs. It’s not a first option – opt for vertical scaling or other optimizations first. But when scaling horizontally becomes unavoidable, choosing the right sharding strategy matters.

Key Takeaways:

  • Hash-based Sharding: Balances write-heavy workloads but struggles with range queries. Costs can rise due to scatter-gather operations and resharding challenges.
  • Range-based Sharding: Great for range queries but prone to hotspots with sequential keys. Manual rebalancing is often required.
  • Directory-based Sharding: Provides flexibility for data placement but adds dependency on a lookup service, increasing complexity.

Quick Comparison:

Strategy Best For Challenges Costs
Hash-based High write volumes Poor range query performance Moderate, with added logic
Range-based Range-heavy queries Hotspots from sequential keys Lower, but uneven scaling
Directory-based Customizable workloads Dependency on lookup service High, due to added overhead

Sharding is a one-way decision – reverse migrations are risky and expensive. Use it only after exhausting simpler scaling methods.

Sharding Strategies Explained: Database Partitioning for Scalability

1. Hash-based Sharding

When it comes to sharding strategies, balancing cost and performance is key. Hash-based sharding offers predictable data placement but brings along operational challenges.

In this method, a hash function is applied to a shard key (like user_id), and records are assigned using modulo arithmetic:

shard_number = hash(shard_key) % number_of_shards

This approach ensures deterministic routing without needing a lookup table.

Performance Gains

One major benefit of hash-based sharding is its ability to evenly distribute write operations. By separating data placement from natural ordering (like timestamps or sequential IDs), it avoids overloading any single shard. Point lookups are quick since the router can calculate the target shard directly, keeping read latency low.

However, this comes at a cost for range queries. Hashing destroys data locality, meaning queries spanning a range of values must be broadcast to all shards and then combined. This scatter-gather process can add 15–25 ms to the p99 tail latency for each additional shard involved. The overall query performance depends on the slowest shard’s response time.

Infrastructure Costs

Using multiple smaller instances can save on raw compute costs – 20 smaller instances cost 47% less than a single high-end server. But when you factor in proxies, monitoring, and backups, total monthly costs end up being 31% higher. Horizontally scaled systems also tend to experience 3.5× more production incidents per month.

Resharding with a basic modulo-based approach requires remapping nearly all data, which can be highly disruptive. Consistent hashing offers a better solution, reducing data movement to roughly 1/N of the total dataset – a significant improvement at scale. For instance, Instagram’s 2012 system used 4,096 logical shards mapped to fewer physical servers. This setup allowed them to scale by simply moving logical shards between servers rather than re-hashing the entire dataset.

"As long as we picked a sufficiently random hash function, we would ensure a uniform distribution of data." – Sammy Steele, Databases Team, Figma

Operational Complexity

Each shard operates as its own instance, requiring separate backups, monitoring, patches, and schema updates. This setup increases the complexity of cross-shard coordination, extending feature deployment times by 3.2×. Notion’s 2023 resharding project, which expanded from 32 to 96 physical RDS instances using 480 logical shards with PostgreSQL logical replication, caused about a one-second user-visible impact per shard during the cutover.

Scalability

Hash-based sharding works well for write-heavy, evenly distributed workloads like session stores, event logs, or user-keyed data. Teams that have scaled such systems successfully recommend starting with a large number of logical shards (e.g., 480 or 4,096) from the beginning. This approach simplifies scaling later on, as it only requires redistributing logical buckets rather than re-hashing the entire dataset. Maintaining shard key discipline is equally important – every performance-critical query should include the shard key in the WHERE clause to avoid expensive scatter-gather operations.

Next, we’ll take a closer look at range-based sharding and how its cost and performance dynamics differ.

2. Range-based Sharding

Range-based sharding splits data into continuous value ranges, with each range assigned to a specific shard. Unlike hash-based sharding, this method improves data locality, making it particularly effective for read-heavy workloads. For instance, user IDs from 1 to 1,000,000 might be stored on one shard, while IDs from 1,000,001 to 2,000,000 are on another. This approach is especially useful for time-series data, geo-partitioned workloads, and multi-tenant SaaS platforms where queries naturally align with specific ranges or entities.

Performance Gains

One of the standout benefits of range-based sharding is its ability to enhance data locality. When a query targets a specific range – like retrieving all orders from a particular month – the system routes the request to a single shard. This eliminates the need for scatter-gather operations, significantly improving performance for read-heavy tasks.

However, there’s a catch: if the shard key is monotonically increasing, such as a timestamp or sequential ID, new write operations tend to pile up on the most recent shard. This creates a hotspot, where one shard might be running at 90% CPU usage while others remain underutilized at around 20%.

"Do not choose a timestamp as the distribution column." – Citus Documentation

Infrastructure Costs

Routing logic for range-based sharding is straightforward and deterministic, often relying on a simple configuration file or lookup table. However, uneven hardware utilization can become a hidden expense. Heavily used shards may require vertical scaling or manual rebalancing, while underutilized shards still consume resources.

Operational Complexity

Managing range-based sharding requires careful monitoring of individual shard metrics rather than relying solely on cluster-wide averages. When a range grows too large or becomes a hotspot, the shard must be split, and routing tables updated. This process often involves dual-writes to maintain consistency.

For example, Discord faced challenges with Cassandra when sharding message history by channel_id. As large servers began to overwhelm specific shards, they developed a custom Rust-based request coalescing layer to handle the load.

"The shard key is the most consequential decision in a sharding setup. A bad shard key creates problems you cannot fix without resharding." – Prathamesh Sonpatki, Evangelist, Last9

Scalability

In multi-tenant systems, using a compound key – such as combining tenant_id with local_id – can co-locate a tenant’s data while avoiding global key collisions. This strategy bypasses the issues of monotonic keys and ensures a more balanced write distribution while still maintaining data locality.

Additionally, pre-sharding into more logical partitions than currently needed can make future scaling less disruptive. Moving a single logical shard is far easier than re-bucketing an entire dataset. Up next, we’ll dive into the pros and cons of each sharding method.

3. Directory-based Sharding

Directory-based sharding takes a different approach compared to hash- and range-based methods. Instead of relying on formulas or continuous ranges, it uses a lookup table to determine where data resides. This table explicitly maps each key to a specific shard, acting like a routing system. Need to adjust where a tenant, user, or key is stored? Just update the corresponding entry in the directory – no formulas required.

Performance Gains

One of the biggest perks of directory-based sharding is precise rebalancing. If a shard becomes overloaded, you can easily move a tenant to a dedicated server by updating a single entry. Instagram famously used this method to isolate high-traffic accounts, showcasing its ability to fine-tune resource allocation.

Notion’s engineering team also leveraged directory-based sharding in a big way. In October 2021, they restructured their Postgres setup by mapping workspaces to 480 logical shards, using workspace_id as the shard key. They chose the number 480 to allow room for future scaling. By July 2023, they had expanded from 32 to 96 physical RDS instances, with the migration causing only about one second of visible impact per shard.

These examples highlight how directory-based sharding can boost performance while offering flexibility for scaling. But there’s more to consider – specifically, infrastructure needs and operational challenges.

Infrastructure Costs

When cached, directory lookups are incredibly fast, typically taking just 1–2 milliseconds. If the lookup relies on a configuration store instead, it’s slower, taking around 5–10 milliseconds. However, since the directory is a critical component, it must be highly available. Tools like Redis, etcd, or ZooKeeper are often used to replicate the directory across data centers and ensure robust failover systems. The stakes are high: if the directory fails, all routed requests fail too.

Another advantage is hardware flexibility. Small tenants can be placed on cost-effective servers, while larger, resource-heavy tenants can be allocated to premium hardware. This allows workloads to be matched with the most suitable infrastructure.

Operational Complexity

Managing cache invalidation is a key challenge. Whenever data moves between shards, every application server must update its local cache to ensure that writes go to the correct shard. If this process isn’t handled properly, it can lead to major issues:

"If [the directory] lies, writes go to the wrong shard." – Gabriel Anhaia, Senior Software Engineer

This highlights the directory’s dual nature: it’s both the system’s greatest strength and its most critical dependency.

Scalability

When it comes to scaling, directory-based sharding is hard to beat. Because placement is fully customizable, it offers unmatched flexibility for multi-tenant systems. Instagram’s engineering team summed it up well:

"Using this approach, we can start with just a few database servers, and eventually move to many more, simply by moving a set of logical shards from one database to another, without having to re-bucket any of our data."

Pros and Cons

Database Sharding Strategies: Cost vs. Performance Comparison

Database Sharding Strategies: Cost vs. Performance Comparison

Here’s a side-by-side comparison of sharding methods, focusing on their cost and performance trade-offs:

Strategy Performance Infrastructure Cost Operational Complexity Scalability
Hash-based Great for write distribution; poor for range queries due to scatter-gather; consistent hashing moves only ~1/N of data when adding a node Moderate – consistent hashing adds routing logic Moderate – resharding is challenging with modulo Very high; ideal for point lookups
Range-based Excellent for range scans; risk of hotspots on sequential keys Low – routing logic is straightforward Moderate – manual rebalancing is required High, but growth can become uneven
Directory-based Flexible; adds 1–2 ms lookup latency when cached High – requires a constantly available metadata service Highest – introduces a new failure domain to manage Maximum – allows moving tenants without re-bucketing

These metrics provide a clear snapshot of how each sharding method balances performance and operational demands.

One unavoidable aspect of sharding is its operational overhead. Tasks like backups, monitoring, and applying security patches must be repeated for every shard, which increases workload as the system grows. For example, during one migration, splitting a 2‑TB database into 12 shards raised monthly infrastructure costs from $14,000 to $19,000. However, this also delivered a 3.2× increase in throughput for just a 1.33× rise in cost.

Real-world scenarios add further layers of complexity. Under heavy query loads, the differences between sharding strategies become more pronounced. Hash-based sharding excels at avoiding write hotspots but struggles with range queries due to scatter-gather operations. Range-based sharding, on the other hand, handles range queries on a single shard but is prone to hotspot issues. Directory-based sharding avoids both problems through manual data placement, but it introduces a critical dependency on its lookup service. If that service fails, all routed requests are disrupted.

"Sharding only helps if you can reliably steer each request to the right place." – Martin Kleppmann, Author of Designing Data‑Intensive Applications

Scalability further highlights the differences. Range-based sharding works well until growth becomes concentrated at one end of the range. Directory-based sharding offers the smoothest scalability, but it demands the highest level of complexity and relies heavily on a highly available lookup layer.

This breakdown helps you weigh whether the operational and financial costs of sharding align with the performance benefits it offers.

Conclusion

Summarizing the analysis of sharding strategies, it’s clear that sharding should only be considered when absolutely necessary.

While sharding can significantly enhance database performance, it comes with high costs and is a one-way decision. Most well-optimized PostgreSQL or MySQL setups can handle substantial workloads without requiring sharding until critical thresholds are reached. The choice between hash-based, range-based, and directory-based sharding hinges on the specific demands of your workload and query patterns.

As outlined in the Semicolony System Design Handbook, sharding should be treated as a last resort after exhausting all vertical scaling options:

"Sharding is the last database lever, not the first. Read replicas, partitioning within a single host, denormalisation, caching, and just buying a bigger machine all come before sharding."

When sharding becomes unavoidable, picking the right strategy is essential. Hash-based sharding works best for high write volumes, range-based sharding suits sequential queries, and directory-based sharding offers the most control. Missteps here can lead to more operational challenges than performance improvements.

Ultimately, the trade-offs are clear: unless simpler optimizations have been fully utilized, the complexities of managing backups, schema migrations, and hot shards outweigh the benefits. Sharding should only be pursued when all other options have been exhausted.

FAQs

How do I know when sharding is truly necessary?

Sharding is not something to jump into lightly – it should only be considered when you’ve reached unavoidable physical limits. These might include exceeding your system’s write throughput, running out of storage capacity, or needing to isolate data geographically.

Before taking the sharding route, explore other scaling options like vertical scaling, read replicas, connection pooling, or table partitioning. These solutions can often address performance bottlenecks without introducing the added complexity of sharding.

Why the caution? Sharding is a one-way street. Once implemented, it can’t be undone easily, and it brings significant challenges. If you do decide to shard, selecting the right shard key is critical. Aim for one with high cardinality and an even data distribution to avoid performance headaches down the road.

What’s the biggest hidden cost of sharding?

The most overlooked downside of sharding is the growing operational complexity it brings. Routine tasks like schema migrations, keeping track of shard health, and handling distributed retries demand a lot of engineering time and effort.

Then there are scatter-gather queries – those queries that don’t use a shard key and end up hitting all shards. These queries increase latency, put extra strain on resources, and can spiral into higher maintenance costs, performance bottlenecks, and hard-to-trace inconsistencies throughout the system.

How do I pick a shard key that won’t backfire later?

When selecting a shard key, aim for a balance between high cardinality, even data distribution, and query alignment. This helps avoid hotspots and minimizes expensive scatter-gather operations. Stick to immutable values, such as UUIDs, instead of fields that may change over time, like usernames. For workloads with frequent range-based queries, range-based sharding is a better fit. Otherwise, hash-based sharding with consistent hashing ensures more stable and scalable performance as your workload grows.

Related Blog Posts