GetHashCode vs. Compare vs. SortKey

by Michael S. Kaplan, published on 2005/02/18 05:36 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/18/376009.aspx


On a regular basis, someone asks the question about what is faster here of the three methods of comparing two strings (this is the latest of them, asked just yesterday):

I’m trying to find out if the gethashcode() method provided with the framework guarantees uniqueness. From as far as I know, It won’t guarantee uniqueness in all cases but it will in the case of strings because the gethascode() method was already overwritten in the string class.

Does this mean I can always use:

Somestring.gethashcode()

to compare two strings since string compares are so expansive?

Now this is not at all a dumb question, and is in fact one of the smarter ways of framing it (I'll let you, the reader, imagine some of the dumb ways one could make such an inquiry -- I have some doozies in my archives!).

Some people take the technical road on answering this question. They note that a string can be 2gb long and a hash code is a single 32-bit number -- so of course it cannot uniquely represent all numbers. They then proceed to calculate the collision probability. But their prob & stats memory is clearly less rusty than mine (or maybe they just enjoyed it more than I did), so I will stick to what I can remember well and not try to calculate odds on the chances of a duplicate here....

There are only two built-in ways of creating a hash code in Whidbey that will return results that have any kind of culturally correct meaning (and there are none prior to Whidbey1):

  1. Use the static StringComparer.Create override that takes a CultureInfo (or choose the appropriate one to use the CurrentCulture or InvariantCulture), and call the GetHashCode override that takes a string.
  2. Use CompareInfo.GetSortKey to create a sort key and then call GetHashCode on it.

From a performance standpoint, #2 seems like it would be more expensive (it is, but not for the reasons you might think). But they both do use the same technique of getting the sort key and calculating a deterministic hash on that.

They both have to run over the entire string to get the sort key and then the entire sort key to build the hash code. Now contrast that with the Compare() case which looks at two strings and returns as soon as it possibly can tell which comes first. The only time it will run over the entire string is if the two strings are almost identical, like

abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyZ

vs.

abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz

or something. In the real world situation of comparing two strings, only testers of comparison methods will find themselves with a large number of such strings, and that is after they specially construct them!

(The fact that more often people test comparisons by comparing "AAA" to "BBB" shows that they do not understand how comparisons work -- such string comparisons are incredibly fast since one finds the difference in the very first character!)

So, lets look at the three methods, and with our new knowledge of the relative speed of each, judge the performance of each (assumes a random distribution of strings, rather than strings constructed to prove me wrong!):

  1. Using Compare on the two strings directly;
  2. Creating sort keys for the two strings and comparing the sort keys;
  3. Creating hash codes for the two strings and comparing the hash codes, then when they are equal falling back on #1 or #2 to break the tie.

If one is doing many comparisons (like in a database) then it does not take very long for #2 to start getting better than #1 -- after all, Compare is only assured to be faster for an individual string comparison. If you are regularly doing comparisons then building an index based on sort key makes a whole lot of sense (and this is precisely how database indexes are built!).

One more note on the hash codes (if you still think it might make sense to cache that System.Int32 value for a great "first pass" on the string)....

There is no meaning that can be derived from one hash code being greater than or less than another -- it certainly does not suggest which string would sort first in a list. The only meaningful comparison of hash codes you can make is whether they are equal (in which case you must use another method to find out the order).


1 - If there is any interest, I can perhaps give an example about creating such a hash code algorithm another day.

This post brought to you by "ۍ" (U+06cd, a.k.a. ARABIC LETTER YEH WITH TAIL)


# KJK::Hyperion on 18 Feb 2005 4:08 AM:

Don't forget strings are immutable: they can be interned (i.e. added to the global pool) safely so that an identity test is a mere pointer comparison (IIRC this is true for OLE BSTRings too). The C# compiler, AFAIK, automatically interns all strings constructed from string literals

# Michael Kaplan on 18 Feb 2005 4:20 AM:

Yes, this is very true. But the number of actual dupliates is probably pretty small, even as compared to strings that would be "duplicates" by collation criteria (different normalized form, different case in a case insensitive situation, etc.).

Plus I don't think the Compare code is really checking this case, which means it has to walk the whole string to find differences -- meaning it will be the slowest possible comparison? :-)

# Tim Smith on 18 Feb 2005 6:53 AM:

This reminds me of some code I once saw.

if (strncmp (pKey ->szName, szTest, 16) == 0 && pKey ->ulHash = ulTestHash)


:O

# Michael Kaplan on 18 Feb 2005 9:25 AM:

Heh heh heh -- that code is *really* silly, especially given the binary nature or strncmp. :-)

I won't claim that the "hash code" test prior to using a sort key would be my first choice, but I guess there are times it would make sense. And the bulk of the expense would be in record insertion, not at lookup time, which could be argued well in many cases.

# AC on 18 Feb 2005 11:45 AM:

You don't like probability and statistics? What were the odds?

# Tim Smith on 18 Feb 2005 12:31 PM:

The application in question was lookup heavy and didn't care about sorting.

Doh, and I just noticed my example is missing a '='. LOL. Classic C/C++ typo. Warnings must be turned off on the MB. *geek joke*

# Michael Kaplan on 27 Feb 2005 5:42 AM:

Someone pointed out to me that the String intern stuff only is done on Equals(), not on Compare(), which explains why I was not seeing that particular optimization in comparisons....

Alexander Savin on 1 Dec 2011 4:59 PM:

Michael, with respect to your comment "The fact that more often people test comparisons by comparing "AAA" to "BBB" shows ..." I would say quite the opposite. It is much more important that a comparison function returns as fast as possible if two input strings mismatch not too far from the beginning because most comparisons in the world compare such unequal strings.

I have no alternative to CompareString but I was able to test performance of CompareStringOrdinal (with IgnoreCase = false) vs. wcscmp that I believe behave the same (binary comparison).

For strings having identical prefix not less than 20-30 characters CompareStringOrdinal beats wcscmp. For longer prefix (>100) it is twice as fast (on my i7 machine) if I pass string lengths instead of -1. (Amazingly, the documentation states that CompareString is optimized for the case of cch == -1 and some other restrictions, but no word it is not the case for CompareStringOrdinal.) However, if two strings differ in the very first character or have a short equal prefix CompareStringOrdinal is up to 3 times slower than wcscmp.

I know wcscmp is just a straight loop (even if intrinsic) and CompareStringOrdinal seems to have a "preparation preamble" and manual loop unrolling. I wonder why CompareStringOrdinal is optimized for non-typical case. I understand there still are some cases where CompareStringOrdinal would be beneficial (like sorting path names for the files located in the same deep directory).


referenced by

2011/12/06 CompareString != CompareStringOrdinal. Even if we want it to be.

go to newer or older post, or back to index or month or day