Java HashCode Collision: How uniform is it’s distribution?

Messaging systems (be it SMS, Voice, Email , Whatsapp etc ) often send similar message content to its users. Hashing technique is often used to generate various ids in these systems. Problem arises when these hashing start causing collisions, There are various ways of resolving hash collision as mentioned in the Wikipedia article, but usage of perfect hash function makes handling much easier.


Basic Hashing Function mechanism

Well this article is on java hashcode collision generated on on more or less similar message contents.

Following program will explain this :

public static void main(String[] args){
System.out.println("I have a common prefixDB".hashCode()); //generates 310265957 as hashcode
System.out.println("I have a common prefixCa".hashCode());
//generates 310265957 as hashcode
}

Seeing the above message content we can easily deduce more similiar the message content , possibilities of hash collision increase considerably.

It can be easily deduced :

If String a and String b have a common prefix and the same length — if n and the statement 31*(b[n-2] — a[n-2]) == (a[n-1] — b[n-1]) is true — it means that first and second strings have the same hashcode.

Now lets dig into Java’s HashCode implementation :

public class Employee 
{
int employeeId = -1;
long salary = -1L;
String name = null;
Department deptObj = null;
// other methods would be in here
@Override
public int hashCode()
{
final int prime = 31;
int hash = 1;
hash = hash * prime + employeeId;
hash = hash * prime + name.hashCode();
hash = hash * prime + (deptObj == null ? 0 : deptObj.hashCode());
hash = hash * prime + (int)( salary ^ (salary >>> 32));
return hash;
}
}

String class implements its own hashCode() using a product sum algorithm over
the entire text of the string. Its defined as
h(s)=∑(s[index] * 31 ^(n-1-index)) where index varies from index =0 to (n-1)
where terms are summed using Java 32-bit int addition, s[index] denotes the index’th character of the input string, and n is the length of input string s.

This approach of using 31 as prime no to generate hashcode was done as it was cheapest to calculate on a RISC machine as mentioned in the Java Bugs by Joshua Bloch(Primes of form p = (2^n-1) lend themselves to optimization of x * p = (p << n) - p which the compiler typically does.). Abstract of that is here :

The table below summarizes the performance of the various hash functions described above, for three data sets:1) All of the words and phrases with entries in Merriam-Webster's 2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars).2) All of the strings in /bin/, /usr/bin/, /usr/lib/, /usr/ucb/ and /usr/openwin/bin/* (66,304 strings, avg length 21 characters).3) A list of URLs gathered by a web-crawler that ran for several hours last night (28,372 strings, avg length 49 characters).The performance metric shown in the table is the "average chain size" over all elements in the hash table (i.e., the expected value of the number of key compares to look up an element).Webster's   Code Strings    URLs
--------- ------------ ----
Current Java Fn. 1.2509 1.2738 13.2560
P(37) [Java] 1.2508 1.2481 1.2454
P(65599) [Aho et al] 1.2490 1.2510 1.2450
P(31) [K+R] 1.2500 1.2488 1.2425
P(33) [Torek] 1.2500 1.2500 1.2453
Vo's Fn 1.2487 1.2471 1.2462
WAIS Fn 1.2497 1.2519 1.2452
Weinberger's Fn(MatPak) 6.5169 7.2142 30.6864
Weinberger's Fn(24) 1.3222 1.2791 1.9732
Weinberger's Fn(28) 1.2530 1.2506 1.2439

Looking at this table, it's clear that all of the functions except for the current Java function and the two broken versions of Weinberger's function offer excellent, nearly indistinguishable performance. I strongly conjecture that this performance is essentially the "theoretical ideal", which is what you'd get if you used a true random number generator in place of a hash function.
I'd rule out the WAIS function as its specification contains pages of random numbers, and its performance is no better than any of the far simpler functions. Any of the remaining six functions seem like excellent choices, but we have to pick one. I suppose I'd rule out Vo's variant and Weinberger's function because of their added complexity, albeit minor. Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous.Josh

Researchers found that using a prime of 31 gives a better distribution to the keys, and lesser no of collisions.

Google Guava provides a much better hashing class with true distribution , check out the link here.


But in real world Hash Collisions occur quite frequently and are easy to generate , So for cases mentioned above where the message content is very similar using a single prime (31 which java uses in its hashcode Generation) will not provide uniform distribution, using two primes numbers like 17 and 37 will give you better distribution and lesser collisions.

public static void main(String[] args)
{
System.out.println(myHashCode("I have a common prefixDB"));
// generates 177503352
System.out.println(myHashCode("I have a common prefixCa"));
// generates 177503346
}
private static int myHashCode(String input)
{
final int prime = 37;
int result = 17;
for (char character : input.toCharArray())
{
result = prime * result + character;
}
return result;
}

For Instance Project Lombok has given examples where they use 59 & 43 in to generate hashes (this is part of class having name and value as member variables, Full example link is here).

@Override public int hashCode() 
{
final int PRIME = 59;
int result = 1;
result = (result*PRIME) + (this.getName() == null ? 43 : this.getName().hashCode());

result = (result*PRIME) + (this.getValue() == null ? 43 : this.getValue().hashCode());
return result;
}

Please Note: I understand hash functions are much debatable topic over the years and much better hashing functions are available which provide fewer collisions and uniform distribution.

This article is my take on hashcode generation.


What am I missing here ? Let me know in comments section and I'll add in!

What’s next? Subscribe Learn INQuiZitively to be the first to read my stories.