What is Cache Stampede ?

Cache stampede is often not considered a big problem for small scale systems but is a common headache for application developers whose application has to serve million of requests. Most applications tend to ignore this problem and many times it brings the whole system down in matter of seconds.

Server Side Caching in general is usually done on to keep unchanged data(or least frequently changing Data) to ease on system load and minimize expensive Database read operations.


Http Request → Application Cache(Local to Application )→ Caching Server (Memcache/Redis) → DB (RDBMS/No SQL)

Right now, in case of application cache miss, system checks in caching server if not found then hits the database. Caching in local application cache or in caching server is always limited to a period meaning the data is put into the cache with a expiry time.

Lets take an example , assume traffic of 1000 rps (request per second) and the local cache as well as caching server expires the data, thus we have 1000 requests hitting the database simultaneosly , thus causing herd effect may result into congestion collapse of the application system.

How can we mitigate this Cache Stampede problem ?

Let me list down possible ways of solving this problem, they are as follows :

  1. Locking.
  2. External Computation.
  3. Internal Computation using Probabilistic early expiration.

Lets look at these approaches one by one :

Locking

One request worker acquires the lock for that cache key and put the same in cache while serving the response so that subsequent request workers get value from the cache.

But developer has to implement options for the remaining workers which were not able to acquire lock :

  • Either Wait until the value is recomputed. (this may lead to Starvation and in turn performance degradation thus causing hung requests)
  • Keep a stale item in the cache to be used while the new value is recomputed

If implemented properly, locking can prevent stampedes altogether, but the main drawback is a developer has to take care of tuning of a time-to-live for the lock, race-conditions, and so on. Sample code is as follows :

public class MyCache<K,V>
{
private Lock lock = new ReentrantLock();
final Map<K, V> cache = new HashMap<K, V>();
public V get(K key)
{
V value = cache.get(key);
if (value != null)
{
return value;
}
try
{
lock.lock();
value = cache.get( key );
if (value != null)
{
return value;
}
value = getFromDB(key);
put(key, value);
}
finally
{
lock.unlock();
}
return value;
}
private V getFromDB(K key)
{
// DO Database Call's
// Get and Return value for the Key K
return null; // Return Value V
}
private boolean put(K key, V value)
{
boolean success = true;
try
{
cache.put(key, value);
}
catch (Exception e)
{
success = false;
// Log Error Message
}
return success;
}
}

External Computation.

Use external process(probably through a cron-job etc) to recompute the value, request workers never compute the value.

  • Disadvantage is bloating of memory as it is always the possibility of bulk of the data is never read in cache.

Internal Computation using Probabilistic early expiration
As the expiration time nears one of the workers volunteers to prefetch recomputed value without evicting the old cache value. And just before expiration this voluteer worker writes new value and extends the ttl(time to live) value of the cache key.



What am I missing here ? Let me know in comments section and I'll add in!

What’s next? Subscribe Learn INQuiZitively to be the first to read my stories.