July 14, 2017

System Design Interview: How to implement TinyURL?


TinyURL shortens the long URL. URL shortening will have below benefits:
  • Save space when displayed, printed or messaged.
  • Consume less space when tweeted.
  • Users are less likely to mistype the shorter URL's.
Normally during the interview, most of us draw below diagram, which is actually wrong:
Why that's the wrong answer? because the person who is taking the interview us not looking for a simple solution where you take a longer url and convert it into a shorter one, store it in a map and returning the long url from the map whenever required. 

The interviewer is interested ti test your knowledge on durability and scalability. The above diagram is correct, but sadly its neither durable nor scalable.

Let us started with the Database design.
In our case we will have a single relationship between long and short URL. We can use relational database (RDBMS), but NoSQL DB is better in this case as they're faster for writes and simple key-value reads.

Identify Core Features
Let's say we want to make our URL 7 characters long:
If we use a base10 system for our short URL, we would only have 10 million (10 ^ 7) possible shortened urls.

If we use a base62 system([A-Z a-z0-9]) for our short URL, we would only have 3.5 trillion (62 ^ 7) possible shortened urls.

Solution 1
In this approach we will compute a unique hash using MDF5 (we can use SHA256) for a given long url. Then we encode it using base 62.

MDF5 will produce 128-bit hash value, after base62 encoding we will get a long string, we will trim the string and take only first 7 characters for the shorturl.
MDF5(longURL)-- > base62encode(128-bit hash value) --> short_url --> trim (short_url, 7)

Above approach has one drawback as it could result in key duplication/ collisions in the DB. However this is unlikely.

Solution 2
In this approach we are not encoding the URl, we will encode a number which will in turn guarantee that no duplicates or collision will occur.

We will have a counter service which will generate random number, which is base62 encoded and a short URL is generated.

Counter (0-3.5 trillion) -- > base62encode() --> short_url

There are two drawbacks of this approach:
  • We can have multiple instances of our webservices which creates the short URL, but we need to have a single counter service. What if that counter service fail?
  • What if the requests spike and our counter service might not be able to handle it?
To solve the single point of failure issue we can use a distributed systems manager, such as Zookeeper which will give server unused ranges.

Lets say we have three servers S1, S2 and S3 (on which multiple instances of our webservices which creates the short URL are running). We can assign different counters to these servers like: S1 can generate the counter with value between 0-1 trillion, S2 with 1-2 trillion and S3 with 2-3 trillion.

There might be a chance that your Zookeeper is down, to avoid this we can run multiple instances of Zookeeper.

Also, if a new server is added we can assign a new unused range. Also if range runs out for any existing server, Zookeeper can assign a new unused range.

There is still one more drawback of solution 2, we are generating URLs in a sequential manner, which can be a security threat. To overcome this, we could add another random 10-15 bits after the counter number to increase the entropy(lack of order or predictability).

We can also use Base64 encoding, by which we can create 64 unique ids with a single character.

Scaling The System

Caching
We can speed up the database reads, by put the data in memory (like redis). When the cache is full, we can remove the some of the URLs by using Least Recently Used (LRU) cache.

Load Balancer
Initially we could use a simple Round Robin approach that distributes income requests equally among our backend servers. This is the easiest way to implement load balancing.

However, it does not take server load into consideration, so our LB would still forward requests to an overloaded/ slow server.

We can use least connection or least response or lease bandwidth load balancer.

-K Himaanshu Shuklaa..

No comments:

Post a Comment