How to build a Tiny URL service that scales to billions?

One of the design constraints of Twitter is that every message is limited to 140 characters but if you like to post URLs in your Twitter messages like I do, those 140 characters become very dear. That’s probably why Twitter automatically converts any URLs over about 30 characters to short URLs using the TinyUrl service or any other service . Later on many url shortening services cropped up in the market namely Bitly, Google Url Shortner, Rebrandly etc. Many business also started using them in their marketing campaigns, ad campaigns etc on various channels.

Creating a unique, repeatable identifier for a URL

I think a lot of people’s first instinct might be to go for hashing the URL string — this isn’t a good idea for a few reasons though:

  • Length : Most normal hashing algorithms (say md5) produce long strings, which kind of goes against the point of a url shortener.

Today we will discuss how to build and develop a tiny url service. First lets break down into 3 parts , algorithm , design and how to scale and its intricacies.

Algorithm:

We have 62 alpha numeric chars i.e. [a-z 0–9 A-Z], though hyphen(-) and underscore(_) are allowed in a url still we want to avoid them as it would be a bad looking url like http://abc.com/c0--rw_ or http://abc.com/______-.
Following is the simple implementation of base10 to base62 converter, that’s all we need to shorten a url.

With 62 chars and a unique string 7,8,9,10,11 char long we can shorten:

62⁷ = 3,521,614,606,208 urls

62⁸ = 218,340,105,584,896 urls

62⁹ = 13,537,086,546,263,552 urls

62¹⁰ = 839,299,365,868,340,224 urls

62¹¹ = 52,036,560,683,837,093,888 urls

So you can see from above we can generate these 62⁶ = ~5 billion possible strings & 62⁸ = ~218 trillion possible strings and much more depending on need.

So lets say we decide we want to generate shorten url for below link

https://www.linqz.io/2018/10/how-to-build-a-tiny-url-service-that-scales-to-billions.html

so we will generate a unique id using base 62 , append it to our hosted domain lets say generated id is qa12WS4 and our hypothetical hosted domain is http://short.io so my tiny url becomes http://short.io/qa12WS4

Now we just have to make sure this link is unique and not assigned again, we will discuss the strategies on how to store in later part of this blog.

Below are the two functions which can be used to generate unique strings:

private static final char[] corpus   = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();/*
* Note if seed is unique then generated base62 number will be unique as well under load balance make sure this value is not same.
*/
public static final String getBase62From10(final long seed)
{
String number = seed + "";
char[] buf = new char[number.length()];
int charPos = number.length() - 1;
BigInteger bigIntegerNumber = new BigInteger(number);
BigInteger radix = BigInteger.valueOf(62);

while (bigIntegerNumber.compareTo(radix) >= 0)
{
buf[charPos--] = corpus[bigIntegerNumber.mod(radix).intValue()];
bigIntegerNumber = bigIntegerNumber.divide(radix);
}
buf[charPos] = corpus[bigIntegerNumber.intValue()];
return new String(buf, charPos, (number.length() - charPos));
}
/**
* @param longNumber
* a positive number in base 62
* @return the same number, in base 10
*/
public static final String getBase10From62(final long longNumber)
{
String number = longNumber + "";
BigInteger value = BigInteger.ZERO;
for (char c : number.toCharArray())
{
value = value.multiply(BigInteger.valueOf(62));
if ('0' <= c && c <= '9')
{
value = value.add(BigInteger.valueOf(c - '0'));
}
if ('a' <= c && c <= 'z')
{
value = value.add(BigInteger.valueOf(c - 'a' + 10));
}
if ('A' <= c && c <= 'Z')
{
value = value.add(BigInteger.valueOf(c - 'A' + 36));
}
}
return value.toString();
}

Design:

Now Lets move on to the design part of the application. Here can be multiple approaches depending on the load of the system or if the application is behind the load balancer etc. First the simplest and second an improvement over the first.

In the simplest case we could probably get by with below columns:

  • id (DB generated sequence ID)

Generating the identifier from a DB

  1. Now, if you provide a String URL value, your code just needs to insert it into the table along with creation date, this will create the row and the unique ID.

This approach has 2 drawbacks:

  • First, we are doing two DB operations , insert and update , under heavy load it will not scale.

Now let’s discuss the improvements, we can improve two things Firstly the DB structure and secondly instead of inserts and update we can do single inserts only. DB structure as follows:

  • id_hash (base 62 generated string as primary key)

Now to have id_hash as primary key we need to have a centralized service which gives me unique seed tokens say for example we use Redis auto-increment feature(as it is atomic in nature) we can get a seed number and generate a base 62 string , this will work behind two instances under load balances as well. A number of approaches can be build around to get unique seed depends on the requirement.

Scale & Intricacies of this system:

Traffic estimates: For 500M new URL shortenings per month, we can expect (100 * 500M => 50B) redirections during that same period. What would be Queries Per Second (QPS) for our system?

New URLs shortenings per second:

a) 500 million / (30 days * 24 hours * 3600 seconds) = ~ 192 URLs/s

b) 1000 million / (30 days * 24 hours * 3600 seconds) = ~386 URLs/s

URLs redirections per second, considering 100:1 read/write ratio:

a) 100 * 500M = 50 Billion Redirections

50 billion / (30 days * 24 * 3600 ) = ~ 19K/s

b) 100 * 1000 Million = 100 billion Redirections

100 billion / (30 days * 24 hours * 3600 sec) = ~38K/s

Storage estimates: Let’s assume we store every URL shortening request (and associated shortened link) for 2years. Since we expect to have 1 Billion new URLs every month, the total number of objects we expect to store will be 30 billion:

1000 million * 2 years * 12 months = 24 billion

Let’s assume that each stored object will be approximately 500 bytes (just a ballpark estimate–we will dig into it later). We will need 15TB of total storage:

24 billion * 500 bytes = 12 TB

Bandwidth estimates: For write requests, since we expect 386 new URLs every second, total incoming data for our service will be 100KB per second:

386 * 500 bytes = ~200 KB/s

For read requests, since every second we expect ~19K URLs redirections, total outgoing data for our service would be 9MB per second.

38K * 500 bytes = ~18 MB/s

Memory estimates: If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80–20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs.

Since we have 38 K requests per second, we will be getting 3.4 billion requests per day:

38 K * 3600 seconds * 24 hours = ~3.4 billion

To cache 20% of these requests, we will need 340GB of memory.

0.2 * 3.4 billion * 500 bytes = ~340 GB

Now since we have the brief idea of scale, we can design constraints to build the system say we can limit users to a certain number of URL creations and redirections per some time period.

And we need to handle load on the DB(SQL or NoSQL) as well. Now lets move to intricacies relating to scale. For DB scale we will require :

a. Range Based Partitioning: We can store URLs in separate partitions based on the first letter of the URL or the hash key or based on the creation date or expiration date. This approach is called range-based partitioning.

The main problem with this approach is that it can lead to unbalanced partitions.

b. Hash-Based Partitioning: In this scheme, we take a hash of the object we are storing. We then calculate which partition to use based upon the hash.

Our hashing function will randomly distribute URLs into different partitions (e.g., our hashing function can always map any key to a number between [1…256]), and this number would represent the partition in which we store our object.

Purging the DB data:

Should entries stick around forever or should they be purged? If a user-specified expiration time is reached, what should happen to the link?

If we chose to actively search for expired links to remove them, it would put a lot of pressure on our database. Instead, we can slowly remove expired links and do a lazy cleanup. Our service will make sure that only expired links will be deleted, although some expired links can live longer but will never be returned to users.

  • Whenever a user tries to access an expired link, we can delete the link and return an error to the user.

Caching is another intricate aspect for frequent access url’s.

How much cache should we have?

We can start with 20% of daily traffic and, based on clients’ usage pattern, we can adjust how many cache servers we need. As estimated above, we need 340 GB memory to cache 20% of daily traffic.

Which cache eviction policy would best fit our needs?

volatile — ttl , LRU etc are multiple options.

References https://www.educative.io/collection/page/5668639101419520/5649050225344512/5668600916475904

There could be many more scaling issues , I have tried to cover the basic minimum requirements while building a scalable shortening url service.


What am I missing here ? Let me know in comments section and I'll add in!
What’s next? Subscribe Learn INQuiZitively to be the first to read my stories.