Computer Science

Database Sharding

Database sharding is a technique used to horizontally partition a database into smaller, more manageable pieces called shards. Each shard contains a subset of the data and is stored on a separate server. This approach can improve performance and scalability of large databases.

Written by Perlego with AI-assistance

Related key terms

1 of 5

9 Key excerpts on "Database Sharding"

eBook - PDF
Guide to Cloud Computing for Business and Technology Managers
From Distributed Computing to Cloudware Applications
- Vivek Kale(Author)
- 2014(Publication Date)
- Chapman and Hall/CRC
  (Publisher)
21.3.3 Row Partitioning or Sharding In cloud technology, sharding is used to refer to the technique of parti-tioning a table among multiple independent databases by row. However, 461 Big Data Computing Applications partitioning of data by row in relational databases is not new and is referred to as horizontal partitioning in parallel database technology. The distinc-tion between sharding and horizontal partitioning is that horizontal par-titioning is done transparently to the application by the database, whereas sharding is explicit partitioning done by the application. However, the two techniques have started converging, since traditional database vendors have started offering support for more sophisticated partitioning strategies. Since sharding is similar to horizontal partitioning, we first discuss differ-ent horizontal partitioning techniques. It can be seen that a good sharding technique depends upon both the organization of the data and the type of queries expected. The different techniques of sharding are as follows: 1. Round-robin partitioning : The round-robin method distributes the rows in a round-robin fashion over different databases. In the exam-ple, we could partition the transaction table into multiple databases so that the first transaction is stored in the first database, the second in the second database, and so on. The advantage of round-robin partitioning is its simplicity. However, it also suffers from the disad-vantage of losing associations (say) during a query, unless all data-bases are queried. Hash partitioning and range partitioning do not suffer from the disadvantage of losing record associations. 2. Hash partitioning method : In this method, the value of a selected attribute is hashed to find the database into which the tuple should be stored.
Sign up to read
Learn more about book
eBook - ePub
Mastering MongoDB 4.x
Expert techniques to run high-volume and fault-tolerant database solutions using MongoDB 4.x, 2nd Edition
- Alex Giamas(Author)
- 2019(Publication Date)
- Packt Publishing
  (Publisher)
Sharding

Sharding is the ability to horizontally scale out our database by partitioning our datasets across different servers—the shards. It has been a feature of MongoDB since version 1.6 was released in August, 2010. Foursquare and Bitly are two of MongoDB's most famous early customers, and have used the sharding feature from its inception all the way to its general release.
In this chapter, we will learn the following topics:

How to design a sharding cluster and how to make the single most important decision concerning its use—choosing the shard key

Different sharding techniques and how to monitor and administrate sharded clusters

The mongos router and how it is used to route our queries across different shards

How we can recover from errors in our shard

Passage contains an image

Why do we use sharding?

In database systems and computing systems in general, we have two ways to improve performance. The first one is to simply replace our servers with more powerful ones, keeping the same network topology and systems architecture. This is called vertical scaling .

An advantage of vertical scaling is that it is simple, from an operational standpoint, especially with cloud providers such as Amazon making it a matter of a few clicks to replace an m2.medium with an m2.extralarge server instance. Another advantage is that we don't need to make any code changes, and so there is little to no risk of something going catastrophically wrong.
The main disadvantage of vertical scaling is that there is a limit to it; we can only get servers that are as powerful as those that our cloud provider can give to us.
A related disadvantage is that getting more powerful servers generally comes with an increase in cost that is not linear but exponential. So, even if our cloud provider offers more powerful instances, we will hit the cost effectiveness barrier before we hit the limit of our department's credit card.
Sign up to read
Learn more about book
eBook - ePub
Parallel Computing Architectures and APIs
IoT Big Data Stream Processing
- Vivek Kale(Author)
- 2019(Publication Date)
- Chapman and Hall/CRC
  (Publisher)
To increase the throughput of transactions from the database, it is possible to have multiple copies of the database. A common replication method is master–slave replication. The master and slave databases are replicas of each other. All writes go to the master and the master keeps the slaves in sync. However, reads can be distributed to any database. Since this configuration distributes the reads among multiple databases, it is a good technology for read-intensive workloads. For write-intensive workloads, it is possible to have multiple masters, but consistency must be ensured if multiple processes update different replicas simultaneously is a complex problem. Additionally, due to the necessity of writing to all masters and effecting the synchronization overhead between the masters rapidly, time to write increases becomes a limiting overhead.

19.2.3 Row Partitioning or Sharding

In cloud technology, sharding is used to refer to the technique of partitioning a table among multiple independent databases by row. However, the partitioning of data by row in relational databases is not new and is referred to as horizontal partitioning in parallel database technology. The distinction between sharding and horizontal partitioning is that horizontal partitioning is done transparently to the application by the database, whereas sharding is explicit partitioning done by the application. However, the two techniques have started converging, since traditional database vendors have started offering support for more sophisticated partitioning strategies. Since sharding is similar to horizontal partitioning, we will first discuss different horizontal partitioning techniques. It can be seen that a good sharding technique depends on both the organization of the data and the type of queries expected.
The different techniques of sharding are as follows:

1. Round-robin partitioning: The round-robin method distributes the rows in a round-robin fashion over different databases. As an example, we could partition the transaction table into multiple databases so that the first transaction is stored in the first database, the second in the second database, and so on. The advantage of round-robin partitioning is its simplicity. However, it also suffers from the disadvantage of losing associations (say) during a query, unless all databases are queried. Hash partitioning and range partitioning do not suffer from the disadvantage of losing record associations.
Sign up to read
Learn more about book
eBook - PDF
Big Data Computing
A Guide for Business and Technology Managers
- Vivek Kale(Author)
- 2016(Publication Date)
- Chapman and Hall/CRC
  (Publisher)
However, the two techniques have started con- verging, since traditional database vendors have started offering support for more sophis- ticated partitioning strategies. Since sharding is similar to horizontal partitioning, we first 219 Introducing Big Data Computing discuss different horizontal partitioning techniques. It can be seen that a good sharding technique depends on both the organization of the data and the type of queries expected. The different techniques of sharding are as follows: 1. Round-robin partitioning: The round-robin method distributes the rows in a round-robin fashion over different databases. In the example, we could partition the transaction table into multiple databases so that the first transaction is stored in the first database, the second in the second database, and so on. The advan- tage of round-robin partitioning is its simplicity. However, it also suffers from the disadvantage of losing associations (say) during a query, unless all databases are queried. Hash partitioning and range partitioning do not suffer from the disad- vantage of losing record associations. 2. Hash partitioning method: In this method, the value of a selected attribute is hashed to find the database into which the tuple should be stored.
Sign up to read
Learn more about book
eBook - PDF
Big Data Analytics
A Practical Guide for Managers
- Kim H. Pries, Robert Dunnigan(Authors)
- 2015(Publication Date)
- Auerbach Publications
  (Publisher)
HBase and Other Big Data Databases • 131 Given the BSON data structures used by MongoDB, vertical data parti-tioning will not work. However, since MongoDB is a bit like a file cabinet, we have horizontal partitioning. We accomplish this splitting (shard-ing) with a sharding key function, which most computer scientists would recognize as the use of a hashing function. Because we can access the information we want across the network, multiple processors/machines function as if they were one large machine (note that we are not speaking of virtual machines per se ). MongoDB can initiate auto-sharding, which allows the database to manage its own portions and simplifies work for the user. In our library database, we might break things up by genre (we did not use “genre” in our example), or by series/nonseries, or whatever makes sense. As with the structure of the database, we want to use some modicum of thought and design to avoid potential digital train wrecks! It should also be noted that when we don’t tell MongoDB the field we want to use to set up the sharding, it will use its own “_id” key, which is arbitrary. This approach is not necessarily deadly to our goal; however, it may affect local caching. While sharding, it is important to note that MongoDB supports easy cluster growth, is relatively transparent (we do not have to know that much about how things are working in order to function), and has some redundancy to handle a failure at a node. A good big data database should not collapse gracelessly when the inevitable event occurs. If we shard on a given server, we produce subsets of our full database; however, if we shard across multiple servers, we produce copies of the subsets for each server. MongoDB supports automatic balancing and migration as the needs of the database change; for example, the situation can arise where a given shard grows to a suboptimal size—the database will generally take care of this situation for us.
Sign up to read
Learn more about book
eBook - ePub
A Definitive Guide to Apache ShardingSphere
- Trista Pan, Zhang Liang, Yacine Si Tayeb(Authors)
- 2022(Publication Date)
- Packt Publishing
  (Publisher)
The features of the L2 function layer are supported by background-running components of the L1 function layer that were introduced in the L1 kernel layer section. Sharding When we talk about sharding, we refer to splitting data that is stored in one database in a way that allows it to be stored in multiple databases or tables, in order to improve performance and data availability. Sharding can usually be divided into Database Sharding and table sharding. Database Sharding can effectively reduce visits to a single database and thereby reduce database pressure. Although table sharding cannot reduce database pressure, it can reduce the data amount of a single table to avoid query performance decrease caused by an increase in index depth. At the same time, sharding can also convert distributed transactions into local transactions, avoiding the complexity of distributed transactions. The sharding module's goal is to provide you with a transparent sharding feature and allow you to use the sharding database cluster like a native database. Read/write splitting When we talk about read/write splitting, we refer to splitting the database into a primary database and a replica database. Thanks to this split, the primary database will handle transactions' insertions, deletions, and updates, while the replica database will handle queries. Recently, databases' throughput has increasingly been facing a transactions-per-second (TPS) bottleneck
Sign up to read
Learn more about book
eBook - ePub
The Cloud at Your Service
The when, how, and why of enterprise cloud computing
- Arthur Mateos, Jonathan Rosenberg(Authors)
- 2010(Publication Date)
- Manning
  (Publisher)
shared-nothing database partitioning on which sharding is based has been around for a decade or more. There have been many implementations over this period, particularly high-profile, in-house-built solutions by internet leaders such as eBay, Amazon, Digg, Flickr, Skype, YouTube, Facebook, Friendster, and even Wikipedia.

Sharding

A decomposition of a database into multiple smaller units (called shards) that can handle requests individually. It’s related to a scalability concept called shared-nothing that removes dependencies between portions of the application such that they can run completely independently and in parallel for much higher throughput.

There was a time when you scaled databases by buying bigger, faster, and more expensive machines. Whereas this arrangement is great for big iron and database vendor profit margins, it doesn’t work so well for the providers of popular web-facing services that need to scale past what they can afford to spend on giant database servers (and no single database server is big enough for the likes of Google, Flickr, or eBay). This is where sharding comes in. It’s a revolutionary new database architecture that, when implemented at Flickr, enabled that service to handle more than one billion transactions per day, responding to requests in less than a few seconds; and that scaled linearly at low cost. It does sound revolutionary, doesn’t it? Let’s look at it more closely.

Sharding in a Nutshell

In the simplest model for partitioning, as diagrammed in figure 5.2 , you can store the data for User 1 on one server and the data for User 2 on another. It’s a federated model. In a system such as Flickr, you can store groups of 500,000 users together in each partition (shard). In the simple two-shard design in figure 5.2
Sign up to read
Learn more about book
eBook - PDF
Database Systems For Advanced Applications '93 - Proceedings Of The 3rd International Symposium On Database Systems For Advanced Applications
- S C Moon, H Ikeda(Authors)
- 1993(Publication Date)
- World Scientific
  (Publisher)
Research on distributed database systems (DDBS) has been increased because DDBSs can improve reliability and availability and fit more naturally in the decentralized structures of many organizations [Ceri 84, Oszu 91]. Data fragmentation and allocation called data distribution is the basis for constructing distributed database systems. Since it is possible to construct distributed database systems on PCs, the research on data distribution for PC-based distributed database design should be emphasized. In this paper a distribution scheme for PC based distributed database systems is proposed by using a mixed fragmentation technique. Section 2 deals with some factors that should be considered for *This research was supported in part by KOSEF (Korea Science & Engineering Foundation) Grant No. 923-1100-088-1. Data fragmentation (or partitioning) is the process that divides a logical object (relation) from the logical schema of the database to several physical objects (files) in a stored database [Nava 84]. There are basically two different ways in data fragmentation: vertical partitioning and horizontal partitioning. Vertical partitioning is the process of dividing attributes into groups. Previous work on vertical partitioning has used objective functions to perform partitioning [Ceri 88, Corn 87, Hamm 79, Nava 84], Since in these approaches, binary partitioning technique should be applied recursively and objective functions and compliment algorithms such as SHIFT algorithm [Nava 84] are needed, we developed a graph theoretic algorithm that generates all meaningful vertical fragments in one iteration [Nava 89]. Horizontal partitioning is the process of dividing tuples in a relation into groups of tuples. In the most of the previous approaches, the problem is that there may be lots of horizontal partitions since at worst case a horizontal partition can be composed of only one tuple [Ceri 82, Cer 83b, Yu 85].
Sign up to read
Learn more about book
eBook - ePub
RDF Database Systems
Triples Storage and SPARQL Query Processing
- Olivier Curé, Guillaume Blin(Authors)
- 2014(Publication Date)
- Morgan Kaufmann
  (Publisher)
The first way, known as vertical scaling or scaling up, upgrades a given machine by providing more central processing unit (CPU) power, more memory, or more disks. This solution has an intrinsic limit because one can reach the limit of upgrading a given machine. This approach is also considered to be very expensive and establishes a strong connection with a machine provider. The second approach, known as horizontal scaling or scaling out, implies buying more (commodity) machines when scalability issues are emerging. This solution has almost no limit in the number of machines one can add, and is less expensive due to low prices of commodity hardware. In terms of data distribution, two main approaches are identified: functional scaling and sharding. With functional scaling, groups of data are created by functions and spread across database instances. In our running example, functions could be users, categories, and blogs. Each of them would be stored on different machines. The sharding approach, which can be used together with functional scaling, aims at splitting data within functions across multiple database instances. For example, suppose that if our user function is too large to fit into one machine, we can create shards out of its data set and distribute them on several machines. The attribute on which we create the shards is named the sharding key. In our running example, this could be the last name of the users. A second factor of this sharding-based distribution is the number of machines available. Let’s consider that we have two machines. A naive approach could be to store all users with a last name starting with letters from A to M included on the first machine and with letters from N to Z on the second machine. This approach may not be very efficient if users’ last names are not evenly distributed over the two machines
Sign up to read
Learn more about book

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.

Explore more topic indexes

1 of 8

View all

Database Sharding

Related key terms

9 Key excerpts on "Database Sharding"

Guide to Cloud Computing for Business and Technology Managers

From Distributed Computing to Cloudware Applications

Mastering MongoDB 4.x

Expert techniques to run high-volume and fault-tolerant database solutions using MongoDB 4.x, 2nd Edition

Sharding

Why do we use sharding?

Parallel Computing Architectures and APIs

IoT Big Data Stream Processing

19.2.3 Row Partitioning or Sharding

Big Data Computing

A Guide for Business and Technology Managers

Big Data Analytics

A Practical Guide for Managers

A Definitive Guide to Apache ShardingSphere

The Cloud at Your Service

The when, how, and why of enterprise cloud computing

Sharding

Sharding in a Nutshell

Database Systems For Advanced Applications '93 - Proceedings Of The 3rd International Symposium On Database Systems For Advanced Applications

RDF Database Systems

Triples Storage and SPARQL Query Processing

Explore more topic indexes