One of the design constraints of Twitter is that every message is limited to 140 characters but if you like to post URLs in your Twitter messages like I do, those 140 characters become very dear. That’s probably why Twitter automatically converts any URLs over about 30 characters to short URLs using the TinyUrl service or any other service . Later on many url shortening services cropped up in the market namely Bitly, Google Url Shortner, Rebrandly etc. Many business also started using them in their marketing campaigns, ad campaigns etc on various channels.

Creating a unique, repeatable identifier for a URL

I think a lot of people’s first instinct might be to go for hashing the URL string — this isn’t a good idea for a few reasons though:

Length : Most normal hashing algorithms (say md5) produce long strings, which kind of goes against the point of a url shortener.
Unique-ness : Obviously, if this is going to be a URL identifier then it needs to be unique, and hashes by their very nature are not unique — which means you would need to handle the scenario where a URL creates an already used hash and has an alternative.
Look-up : Most of the hashes are not (easily) reversible, you would need to look up the URL using the hash as the db key — which may not be ideal given a very large set of URLs (imagine billions)

Today we will discuss how to build and develop a tiny url service. First lets break down into 3 parts , algorithm , design and how to scale and its intricacies.

Algorithm:

We have 62 alpha numeric chars i.e. [a-z 0–9 A-Z], though hyphen(-) and underscore(_) are allowed in a url still we want to avoid them as it would be a bad looking url like http://abc.com/c0--rw_ or http://abc.com/______-.
Following is the simple implementation of base10 to base62 converter, that’s all we need to shorten a url.

With 62 chars and a unique string 7,8,9,10,11 char long we can shorten:

62⁷ = 3,521,614,606,208 urls

62⁸ = 218,340,105,584,896 urls

62⁹ = 13,537,086,546,263,552 urls

62¹⁰ = 839,299,365,868,340,224 urls

62¹¹ = 52,036,560,683,837,093,888 urls

So you can see from above we can generate these 62⁶ = ~5 billion possible strings & 62⁸ = ~218 trillion possible strings and much more depending on need.

So lets say we decide we want to generate shorten url for below link

https://www.linqz.io/2018/10/how-to-build-a-tiny-url-service-that-scales-to-billions.html

so we will generate a unique id using base 62 , append it to our hosted domain lets say generated id is qa12WS4 and our hypothetical hosted domain is http://short.io so my tiny url becomes http://short.io/qa12WS4

Now we just have to make sure this link is unique and not assigned again, we will discuss the strategies on how to store in later part of this blog.

Below are the two functions which can be used to generate unique strings:

private static final char[] corpus   = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();/*
* Note if seed is unique then generated base62 number will be unique as well under load balance make sure this value is not same.
*/
public static final String getBase62From10(final long seed)
{
  String number = seed + "";
  char[] buf = new char[number.length()];
  int charPos = number.length() - 1;
  BigInteger bigIntegerNumber = new BigInteger(number);
  BigInteger radix = BigInteger.valueOf(62);
 
  while (bigIntegerNumber.compareTo(radix) >= 0)
  {
   buf[charPos--] = corpus[bigIntegerNumber.mod(radix).intValue()];
   bigIntegerNumber = bigIntegerNumber.divide(radix);
  }
  buf[charPos] = corpus[bigIntegerNumber.intValue()];
  return new String(buf, charPos, (number.length() - charPos));
}
/**
* @param longNumber
* a positive number in base 62
* @return the same number, in base 10
*/
public static final String getBase10From62(final long longNumber)
{
  String number = longNumber + "";
  BigInteger value = BigInteger.ZERO;
  for (char c : number.toCharArray())
  {
    value = value.multiply(BigInteger.valueOf(62)); 
    if ('0' <= c && c <= '9')
    {
      value = value.add(BigInteger.valueOf(c - '0'));
    }
    if ('a' <= c && c <= 'z')
    {
      value = value.add(BigInteger.valueOf(c - 'a' + 10));
    }
    if ('A' <= c && c <= 'Z')
    {
      value = value.add(BigInteger.valueOf(c - 'A' + 36));
    }
   }
   return value.toString();
}

Design:

Now Lets move on to the design part of the application. Here can be multiple approaches depending on the load of the system or if the application is behind the load balancer etc. First the simplest and second an improvement over the first.

In the simplest case we could probably get by with below columns:

id (DB generated sequence ID)
original_url — original URL value
shorten_url
creation_date
expiration_date

Generating the identifier from a DB

Now, if you provide a String URL value, your code just needs to insert it into the table along with creation date, this will create the row and the unique ID.
Next, fetch that unique numeric ID, and convert it to base-62 (this will convert the numeric value into the base-62 representation (rather than normal base10, it will allow 0–9, a-z, A-Z as characters. This gives you an identifier in the form of qa12WS4 .
Now append the base62 string into your base url of your short domain http://short.io and voila you get a shorten url as http://short.io/qa12WS4 and update it to the short url column with expiration date.
Now You have to just write redirection logic as who ever hits this shorten url you fetch the details from Db and redirect it to original Url.

This approach has 2 drawbacks:

First, we are doing two DB operations , insert and update , under heavy load it will not scale.
Secondly Incase of Database migration , the sequence ids can not be merged as we might have same sequence id generated into two tables.

Now let’s discuss the improvements, we can improve two things Firstly the DB structure and secondly instead of inserts and update we can do single inserts only. DB structure as follows:

id_hash (base 62 generated string as primary key)
original_url
shorten_url
creation_date
expiration_date

Now to have id_hash as primary key we need to have a centralized service which gives me unique seed tokens say for example we use Redis auto-increment feature(as it is atomic in nature) we can get a seed number and generate a base 62 string , this will work behind two instances under load balances as well. A number of approaches can be build around to get unique seed depends on the requirement.

Scale & Intricacies of this system:

Traffic estimates: For 500M new URL shortenings per month, we can expect (100 * 500M => 50B) redirections during that same period. What would be Queries Per Second (QPS) for our system?

New URLs shortenings per second:

a) 500 million / (30 days * 24 hours * 3600 seconds) = ~ 192 URLs/s

b) 1000 million / (30 days * 24 hours * 3600 seconds) = ~386 URLs/s

URLs redirections per second, considering 100:1 read/write ratio:

a) 100 * 500M = 50 Billion Redirections

50 billion / (30 days * 24 * 3600 ) = ~ 19K/s

b) 100 * 1000 Million = 100 billion Redirections

100 billion / (30 days * 24 hours * 3600 sec) = ~38K/s

Storage estimates: Let’s assume we store every URL shortening request (and associated shortened link) for 2years. Since we expect to have 1 Billion new URLs every month, the total number of objects we expect to store will be 30 billion:

1000 million * 2 years * 12 months = 24 billion

Let’s assume that each stored object will be approximately 500 bytes (just a ballpark estimate–we will dig into it later). We will need 15TB of total storage:

24 billion * 500 bytes = 12 TB

Bandwidth estimates: For write requests, since we expect 386 new URLs every second, total incoming data for our service will be 100KB per second:

386 * 500 bytes = ~200 KB/s

For read requests, since every second we expect ~19K URLs redirections, total outgoing data for our service would be 9MB per second.

38K * 500 bytes = ~18 MB/s

Memory estimates: If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80–20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs.

Since we have 38 K requests per second, we will be getting 3.4 billion requests per day:

38 K * 3600 seconds * 24 hours = ~3.4 billion

To cache 20% of these requests, we will need 340GB of memory.

0.2 * 3.4 billion * 500 bytes = ~340 GB

Now since we have the brief idea of scale, we can design constraints to build the system say we can limit users to a certain number of URL creations and redirections per some time period.

And we need to handle load on the DB(SQL or NoSQL) as well. Now lets move to intricacies relating to scale. For DB scale we will require :

a. Range Based Partitioning: We can store URLs in separate partitions based on the first letter of the URL or the hash key or based on the creation date or expiration date. This approach is called range-based partitioning.

The main problem with this approach is that it can lead to unbalanced partitions.

b. Hash-Based Partitioning: In this scheme, we take a hash of the object we are storing. We then calculate which partition to use based upon the hash.

Our hashing function will randomly distribute URLs into different partitions (e.g., our hashing function can always map any key to a number between [1…256]), and this number would represent the partition in which we store our object.

Purging the DB data:

Should entries stick around forever or should they be purged? If a user-specified expiration time is reached, what should happen to the link?

If we chose to actively search for expired links to remove them, it would put a lot of pressure on our database. Instead, we can slowly remove expired links and do a lazy cleanup. Our service will make sure that only expired links will be deleted, although some expired links can live longer but will never be returned to users.

Whenever a user tries to access an expired link, we can delete the link and return an error to the user.
A separate Cleanup service can run periodically to remove expired links from our storage and cache. This service should be very lightweight and can be scheduled to run only when the user traffic is expected to be low.
We can have a default expiration time for each link (e.g., two years).
Should we remove links that haven’t been visited in some length of time, say six months? This could be tricky. Since storage is getting cheap, we can decide to keep links forever.

Caching is another intricate aspect for frequent access url’s.

How much cache should we have?

We can start with 20% of daily traffic and, based on clients’ usage pattern, we can adjust how many cache servers we need. As estimated above, we need 340 GB memory to cache 20% of daily traffic.

Which cache eviction policy would best fit our needs?

volatile — ttl , LRU etc are multiple options.

References https://www.educative.io/collection/page/5668639101419520/5649050225344512/5668600916475904

There could be many more scaling issues , I have tried to cover the basic minimum requirements while building a scalable shortening url service.

What am I missing here ? Let me know in comments section and I'll add in!

What’s next? Subscribe Learn INQuiZitively to be the first to read my stories.

How to build a Tiny URL service that scales to billions?

Algorithm:

Design:

Scale & Intricacies of this system:

About Author

Book Search

Featured Post

Parsing Huge XLSX files ins fastest possible way

Popular Posts

Categories

How to build a Tiny URL service that scales to billions?

Algorithm:

Design:

Scale & Intricacies of this system:

About Author

You might also like

Book Search

Featured Post

Parsing Huge XLSX files ins fastest possible way

Popular Posts

Categories