July 16, 2019

System Design Interview: Consistent Hashing

What is Hashing?
Hashing is the process of mapping one piece of data to another piece of data of fixed size. A function is usually used for mapping objects to hash code (which is typically an integer) known as a hash function.

Let's say we have fixed buckets numbered from 0-N. We have X number of balls (which is typically greater than or equal to N), and now we want to distribute these balls in the different buckets.

We can write a hash function, which will generate a random number based on color, size, the weight of the ball. Let say it return a hashcode 'hc'.

Now we will do hc modulo N (where N is the number of the bucket), this will return a number which is the bucket id.

There can be many possible balls which will map to the same integer. This is called a collision. Common solutions for handling collision are Chaining and Open Addressing. There are two kinds of hash functions cryptographic and non-cryptographic which are used for different purpose.

Hash table or Hash Map is a common data structure in computer science which is used for constant time lookup.

What is Distributed Hashing?
Let's say we are storing employees in a table. And the number of employees kept growing with each passing day and it becomes difficult to store all employee information in just one table (which can fit on a single computer).

In this situation, we will try to distribute the table to multiple servers to avoid memory limitation of one server. Objects (and their keys) are distributed among several servers. This kind of setup is very common for in-memory caches like Memcached, Redis etc.

Now we have multiple servers, how are we going to determine which server is going to store the particular employee? The simplest solution for this is to take the hash modulo of the number of servers.

For example, server = hash(employee) modulo N, where N is the number of servers.

To store a employee, first hash the employee to get the hash code, then apply modulo of the number of server to get the server where we will be storing the employee.

Let's assume each server has its own cache. While retrieving the employees, the server first check if its present in cache, if yes it returns the details. Else it get the details from the table, put it in cache and return it.

What is Rehashing?
As the name suggests, Rehashing is the process of re-calculating the hash code of already stored entries (Key-Value pairs). This is usually done when we move the existing entries to another bigger size hashmap when the threshold is reached/crossed. Rehashing of a hash map is done when the number of elements in the map reaches the maximum threshold value

We stored our employees in a distributed environment of 3 servers (S1, S2 and S3). Now let's assume due to some reason one of the servers (S3) crashed, it’s no longer able to accept a request. Now we are only left with two servers (S1 and S2).

Now, since S3 is crashed, its cache is lost, and everything is rehased. Assume we somehow managed to take the backup of the table from S3, and now after the rehashing the data from the table is distributed on S1 and S2.

If there is a request for Employee E1, which was present in S3. But now after the rehashing details of E1 is stored in E2, since the cache is lost that employee will be again fetched from the origin and stored in the cache of S2.

Because of rehashing the server location for almost all employees is changed, not only for the employees stored in S3. Why? Because to get the server in which a particular employee was stored, we call the hash code and take module with the number of servers. The number of servers was 3, but now it's 2. This will increase the load on origin in case of caching servers as there will be a cache miss of keys and all of them needs to be rehashed. This is known as rehashing problem.

Explain Consistent Hashing. And how it is used for Load Balancing?
Load Balancing is a key concept for system design. One of the popular ways to balance load in a system is to use the concept of consistent hashing.

Consistent hashing solves the problem of rehashing by providing a distribution scheme which does not directly depend on the number of servers.

Consistent Hashing allows requests to be mapped into hash buckets while allowing the system to add and remove nodes(Servers) flexibly so as to maintain a good load factor on each machine.

Consistent Hashing operates independently of the number of servers in a distributed hash table by assigning them a position on an abstract circle, or hash ring. This allows servers and objects to scale without affecting the overall system.

Consistent hashing forms a keyspace, which is also called continuum. As a node joins the cluster, it picks a random number, and that number determines the data it's going to be responsible for. Everything between this number and one that's next in the ring and that has been picked by a different node previously, is now belong to this node. The resulting partition could be of any size theoretically. It could be a tiny slice, or a large one.

Suppose our hash function output range in between zero to INT_MAX, then this range is mapped onto the hash ring so that values are wrapped around. All keys and servers are hashed using the same hash function and placed on the edge of the circle. To find out which server to ask for a given key or store a given key, we need to first locate the key on the circle and move in a clockwise direction until we find a server.

e.g, let's say we have stored employees 1, 2, 3, 4, 5 and 6 on 3 servers. Before the crash, employees 4, 6, 3 were stored in S2, 5 and 1 in S3 and 2 in S1. All keys from server S3 will be moved to server S1 but keys stored on server S1 and S2 are not relocated. Now employees 4, 6, 3 were stored in S2, 5, 1 and 2 in S1.

In the consistent hashing when a server is removed or added then the only key from that server are relocated. But there is one problem when server S3 is removed then keys from S3 are not equally distributed among remaining servers S1 and S2. They were only assigned to server S1 which will increase the load on server S1.

How to solve the non-uniform distribution of data/keys?
Virtual nodes/replication is the solution to evenly distribute the load among servers when a server is added or removed. Instead of placing actual nodes on the ring, we placed virtual nodes there. Let’s assume we have 720 virtual nodes on a circle (thus placing them at an interval of 0.5 degrees).

Now we can map each of these virtual nodes to actual/real nodes such that virtual nodes at different locations within the ring map to the same real node.

We can use another hash function which randomly maps virtual nodes to real nodes. This way we get rid of hot spots due to uneven distribution of data

e.g: instead of server labels S1, S2 and S3, we will have S10 S11 .. S19, S20 S21 .. S29 and S30 S31 .. S39. The number of replicas (also known as weight), depends on the situation.
All keys which are mapped to replicas Sij are stored on server Si. To find a key we do the same thing, find the position of the key on the circle and then move forward until you find a server replica. If server replica is Sij then the key is stored in server Si.

Suppose server S3 is removed, then all S3 replicas with labels S30 S31 .. S39 will be removed. Now the object keys adjacent to S3x labels will be automatically re-assigned to S1x and S2x. At the same time, all keys originally assigned to S1 and S2 will not be moved.

Similar things happen if we add a server. Suppose we want to add a server S4 as a replacement of S3 then we need to add labels S40 S41 .. S49. In the ideal case, one-third of keys from S1 and S2 will be reassigned to S4.

Only the K/N number of keys are needed to remapped when a server is added or removed, where K is the number of keys and N is the number of servers.

-K Himaanshu Shuklaa..

No comments:

Post a Comment