by Michael S. Kaplan, published on 2005/02/18 05:36 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/18/376009.aspx
On a regular basis, someone asks the question about what is faster here of the three methods of comparing two strings (this is the latest of them, asked just yesterday):
I’m trying to find out if the gethashcode() method provided with the framework guarantees uniqueness. From as far as I know, It won’t guarantee uniqueness in all cases but it will in the case of strings because the gethascode() method was already overwritten in the string class.
Does this mean I can always use:
Somestring.gethashcode()
to compare two strings since string compares are so expansive?
Now this is not at all a dumb question, and is in fact one of the smarter ways of framing it (I'll let you, the reader, imagine some of the dumb ways one could make such an inquiry -- I have some doozies in my archives!).
Some people take the technical road on answering this question. They note that a string can be 2gb long and a hash code is a single 32-bit number -- so of course it cannot uniquely represent all numbers. They then proceed to calculate the collision probability. But their prob & stats memory is clearly less rusty than mine (or maybe they just enjoyed it more than I did), so I will stick to what I can remember well and not try to calculate odds on the chances of a duplicate here....
There are only two built-in ways of creating a hash code in Whidbey that will return results that have any kind of culturally correct meaning (and there are none prior to Whidbey1):
From a performance standpoint, #2 seems like it would be more expensive (it is, but not for the reasons you might think). But they both do use the same technique of getting the sort key and calculating a deterministic hash on that.
They both have to run over the entire string to get the sort key and then the entire sort key to build the hash code. Now contrast that with the Compare() case which looks at two strings and returns as soon as it possibly can tell which comes first. The only time it will run over the entire string is if the two strings are almost identical, like
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyZ
vs.
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
or something. In the real world situation of comparing two strings, only testers of comparison methods will find themselves with a large number of such strings, and that is after they specially construct them!
(The fact that more often people test comparisons by comparing "AAA" to "BBB" shows that they do not understand how comparisons work -- such string comparisons are incredibly fast since one finds the difference in the very first character!)
So, lets look at the three methods, and with our new knowledge of the relative speed of each, judge the performance of each (assumes a random distribution of strings, rather than strings constructed to prove me wrong!):
If one is doing many comparisons (like in a database) then it does not take very long for #2 to start getting better than #1 -- after all, Compare is only assured to be faster for an individual string comparison. If you are regularly doing comparisons then building an index based on sort key makes a whole lot of sense (and this is precisely how database indexes are built!).
One more note on the hash codes (if you still think it might make sense to cache that System.Int32 value for a great "first pass" on the string)....
There is no meaning that can be derived from one hash code being greater than or less than another -- it certainly does not suggest which string would sort first in a list. The only meaningful comparison of hash codes you can make is whether they are equal (in which case you must use another method to find out the order).
1 - If there is any interest, I can perhaps give an example about creating such a hash code algorithm another day.
This post brought to you by "ۍ" (U+06cd, a.k.a. ARABIC LETTER YEH WITH TAIL)
# KJK::Hyperion on 18 Feb 2005 4:08 AM:
# Michael Kaplan on 18 Feb 2005 4:20 AM:
# Tim Smith on 18 Feb 2005 6:53 AM:
# Michael Kaplan on 18 Feb 2005 9:25 AM:
# AC on 18 Feb 2005 11:45 AM:
# Tim Smith on 18 Feb 2005 12:31 PM:
# Michael Kaplan on 27 Feb 2005 5:42 AM:
Alexander Savin on 1 Dec 2011 4:59 PM:
Michael, with respect to your comment "The fact that more often people test comparisons by comparing "AAA" to "BBB" shows ..." I would say quite the opposite. It is much more important that a comparison function returns as fast as possible if two input strings mismatch not too far from the beginning because most comparisons in the world compare such unequal strings.
I have no alternative to CompareString but I was able to test performance of CompareStringOrdinal (with IgnoreCase = false) vs. wcscmp that I believe behave the same (binary comparison).
For strings having identical prefix not less than 20-30 characters CompareStringOrdinal beats wcscmp. For longer prefix (>100) it is twice as fast (on my i7 machine) if I pass string lengths instead of -1. (Amazingly, the documentation states that CompareString is optimized for the case of cch == -1 and some other restrictions, but no word it is not the case for CompareStringOrdinal.) However, if two strings differ in the very first character or have a short equal prefix CompareStringOrdinal is up to 3 times slower than wcscmp.
I know wcscmp is just a straight loop (even if intrinsic) and CompareStringOrdinal seems to have a "preparation preamble" and manual loop unrolling. I wonder why CompareStringOrdinal is optimized for non-typical case. I understand there still are some cases where CompareStringOrdinal would be beneficial (like sorting path names for the files located in the same deep directory).
referenced by