Database Partitioning Simplified
You are an event organizer and let’s say all locations where you organize your events and parties are of the same size, 100 people at a time.
Inspired by the quality of service you deliver to your customers, you got an exciting offer to organize a party of 1000 members. It’s an opportunity to take your business to heights.
You will try to find a place for 1000 peoples to enjoy the party simultaneously. It’s arduous and expensive to find such a space.
Data Partitioning (Horizontal Scaling)
You know you can manage as many as required spaces of a capacity of 100 people. So you applied the basics of maths.
In a single hall, 100 people can enjoy a party.
We have to organize a party for 1000 people
So, how many halls we need to book = 1000/100 = 10 halls.
You just did database partitioning. Don’t believe me, just review the problem again.
You have databases with a capacity of 50 TB and you want to store 500 TB of data. How many databases do you need?
Number of database required = 500 TB/50 TB = 10 databases
One more example, your server can handle 10k requests per second and you need to handle a load of 500k requests per second. How many servers do you need?
Number of servers needed = 500k / 10k = 50 servers.
Each server in the above examples acts as a database partitioning.
Condition for data partitioning
A piece of data should exactly belong to one partition.
Skew and Hotspot
Back to our party example, you booked 10 halls for organizing parties for 1000 people.
At the time of the party, you observed that each hall doesn’t have exactly 100 people as you expected. In some halls, there are more than 100 people and in some, there are less than 100. This is called Skew.
You observe that all 1000 people came to the same hall and the remaining 9 halls are empty. What a wastage of resources and experience. This is known as Hot spot situation.
Skew and Hotspot Solution
You realized the hotspot problem is not good for business. So you tried thinking of various solutions.
Approach #1 ( Random Routing )
You set up a reception outside of the hall, receptionist randomly routes people to different halls.
This approach almost equally distributes the people across different halls. But the main drawback is if someone asks the receptionist in which hall a particular person is partying, the receptionist has no idea.
Approach #2 (Fixed Partitioning)
You ask the employee Id range from the HR of the organization which is partying with you. Then you divide like employee Id 101 to 200 will party in hall no. 1, employee Id 501 to 600 will party in hall no. 10, etc.
Databases big table and HBase use this approach.
Approach #3 (Hashing)
The hash function converts a given input to a number.
In our example, we take employee Id or Name. Our so talented receptionist applies the following formula to compute the hall no.
Hall No. = Hash(Employee Id) % Total number of Halls
Based you this formula, people are rounted to the corresponding hall.
The hash function used by Mongo DB and Cassandra is MD5.