URL Shortener System Design – Interview Question
Design a URL Shortener System. Let’s find out how to answer this question in a technical interview.
Since it’s a system design question, it’s very open ended. As a candidate, you need to ask lots of questions and have a good discussion about various approaches that you can take. You can even discuss similar pre-existing systems like bit.ly, tinyURL, etc.
To begin with, you can ask if you are designing a web page or just an API. I think it makes more sense here to just design a service/ API with two main endpoints:
i) Given a long URL -> generate a short URL and return it
ii) Given a short URL -> return the corresponding long URL
You can have extra features like user can pick the short URL or the short URLs expire after some time. You need to understand that these are secondary or extraneous features. The focus should be on two things: the core functionalities and scalability.
So, let’s look at the core functionality first…
How do you generate the short URL?
The first question that arises here is: what all are the acceptable characters in the generated short URL?
The acceptable characters for a URL are small a to z, capital A to Z and digits 0 to 9. You can discuss about special characters to be included or not. If yes, what all characters make sense? In the real world, there are certain characters which are allowed like dash (-), dot (.), etc. And there are certain special characters which are not allowed like space ( ), backslash (\), etc. And lastly there are certain reserved characters like question mark (?), forward slash (/), etc. These reserved characters cannot be used in your generated short URL because they are interpreted differently.
It’s perfectly acceptable to mention here that you can look it up on Wiki/ StackOverflow or the actual RFC for the list of acceptable characters in a URL.
For simplicity, let’s stick to 26 small characters, 26 capital characters and 10 digits.
So overall, it’s 26 + 26 + 10 i.e. 62 characters. If your short URL is 7 characters long, then you have 62^7 unique combinations possible. This is equal to a few trillions.
You can discuss if the range is good enough for the system you are designing. That depends on how many short URLs you expect to create/ store.
Coming back to the question of how to design the short URL –
The first solution that comes to mind is using a counter. The counter keeps incrementing every time a new request is received. That way it’s going to be unique as we don’t reuse the same integer. We can then do base 62 encoding and return the short URL. To help you understand what is base 62 encoding, let me show you an example:
When you convert a decimal number to a binary number, you basically do base 2 encoding.
Suppose, you have number 786. When you convert it to binary, you repeatedly divide it by 2 and
786%2 = 0 786/2 = 393
393%2 = 1 393/2 = 196
196%2 = 0 196/2 = 98
98%2 = 0 98/2 = 49
49%2 = 1 49/2 = 24
And so on, the final binary number result is:
Similarly, to do base 62 encoding, you will have to repeatedly divide your number by 62 and use a representation like this:
Numbers 0-9 are represented by 0-9.
Small a-z are represented by 10 to 35.
Capital A- Z are represented by 36 to 61
To convert 786 to base 62:
786%62 = 42 786/62 = 12
12%62 = 12 12/62 = 0
786 with base 62 encoding will be “cG”.
Since this is shorter than 7 characters, we can either append zeros to the generated short URL or just keep it like this and make sure our algorithm handles it properly.
The concern with the counter-based approach is that your next short URL can be predictable. This may or may not be of concern to the system design. You can always bring it up. And then see if the interviewer cares about it or not. There are two possible ways that I can think of right now to make it more unpredictable. First is, we can add a few extra random bits at the end or front of the counter result. This will make it difficult to guess but will also make the short URL a little longer.
Another approach can be instead of incrementing the counter by 1, we can get a random number picked by a random number generator. Once the number is picked, we will have to check if its previously used or not. You keep generating these random numbers until you find the one which has not been used previously. This approach can create delays on generating the short URL, if the random number generator keeps generating used numbers. The probability of this will go high as more and more numbers get used. So this might not be the best approach but still worth mentioning during the interview and ruling it out together.
Hashing the Long URL:
Another approach that comes to mind is what if you hash the long URL into a short string and return the encoded hash value as short URL. With a hash function, you can certainly get shorter outputs. You can then encode the hash value to base 62 and return it as short URL. Using MD5 algorithm, we get a 128-bit hash value. Upon encoding our result will be 21 characters long, we can choose only first 7 characters.
The problem with that approach can be possible hash collision. Since you want the short URL to be unique and no two long URLs should have the same short URL, this might be tricky here. You must check that if the short URL has been previously used or not. If previously used, you should keep recalculating the hash until it’s unique or use some other approach like choosing different characters from the encoded hash value.
This may not be the best approach but it’s still worth mentioning during the interview. These design questions are more about discussing the pros and cons of different approaches and have a good conversation with the interviewer and not about the perfect design.
So far the counter based approach looks most promising, so let’s stick to that.
The next question that the interviewer will most probably ask you will be around scalability. What if you are receiving thousands of requests per second? How will you scale it? This is a very common question asked during system design interviews.
To scale the system, the first option is, you can add a load balancer in front of it and have multiple processes take care of it. It will look something like this:
The issue is if you have a common counter process, that can be the bottle neck. And if you have multiple counter processes, how do you make sure the different counters are not returning the same number. If they are somehow in sync with each other then they can ensure that their results are unique, but the syncing can be cumbersome and error prone.
The way you can combat collision among multiple processes is, you can divide the range among them. For e.g. Counter 1 can take 1 to 1 million, Counter 2 can take from 1 mllion 1 to 2 million and counter 3 can take 2 million 1 to 3 million. That way each of them has a unique range and that ensures that their output will be unique. You can have a coordination service which can take care of assigning the range to each of them, once some counter exhausts it’s range.
Another common design requirement for URL shortener design question is that the system should be extremely fast. Given a short URL, you need to return the corresponding long URL quick so that the customer can be redirected quickly. You can discuss about caching this data in the server. You can use Redis/ memcache, etc. for this purpose. If the short URL is in cache, bingo. If not, we can hit the DB, get the info from DB and store it in the cache and return it. Next time, if request for same URL comes again it should be there in cache. If while inserting into cache, cache is full, we can remove the one which has not been used in a long time. Such a policy is called LRU (Least Recently Used) policy for cache eviction. Imagine one of your short URL is a big hit one day and you are receiving a lot of requests for the same short URL.
Next question will be what kind of DB you can use? You need a fast DB so you can get the long URL based on short URL quickly as well as check if the short URL is previously used or not. We have broadly 2 options:
SQL DBs are usually good when you have complex data with joins and relationships. In our case, it’s just one single entity of relationship between long URL and short URL. Also, SQL DBs have ACID properties and if we have proper constraints in place like the short URL has to be unique then we can be sure that they will always be unique irrespective of the number of workers accessing the DB simultaneously.
If we use a NoSQL DB, then we can use the number or the short URL as the key. This is simpler and can have better performance but do mention that the data here can have eventual consistency. Since data is not complex and horizontal scaling is much better in this scenario, NoSQL DB should be our choice here.
Next question that might come to you is how would you scale the database component?
Well, the answer here is sharding/ partitioning the database. How will you partition the data? You can divide the data based on range. For e.g. all the short URLs that start from a to j are stored in one shard and from k to t in the second shard and so on. Depending on how many shards and how much we expect each shard to be loaded. Another way of data partitioning can be Hash Based partitioning where you calculate the hash value of the short URL and that decides which shard the data will be stored in.
Towards the end, if you still have time, now you can discuss those extraneous features like if you want to allow users to be able to choose the short URL or some part of it. Or if you want to keep those links forever or expire them after some time. You can also discuss about the analytics functionalities that you can add on top of it which might be useful to companies who want to know how their short URLs are performing. Just FYI, that’s how these URL shortener companies make money.
Again, you will have various pros and cons for each of these features as well as their implementations.
Lastly, I will highly recommend you go through the real architecture of various systems at different organizations. That will give you good ideas on how these companies solve these common problems with scalability, security, cost while keeping it fast and easy to maintain. All these requirements are sometimes hard and may contradict each other and need a good balance/ trade off. The more knowledge you have, the better discussions you can have during the interview.
Hope this helps with this type of interview question. I will be adding more such questions to my YouTube channel. So please subscribe.
Until next time, happy coding 😊