by Michael S. Kaplan, published on 2006/03/25 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/25/560416.aspx
(No, this is not a post about anyone breaking up with me and telling me that they need their space)
In Microsoft's implementation of collation, we have several different categories of characters, and rules for dealing with each category.
One of the interesting categories is the SYMBOL category. All of the miscellaneous odd symbols show up here. And they all come before the various letters and numbers.
Of course, as a feature most of the symbols do not really have any linguistic meaning that would foster a set of rules for how to sort them. So as I pointed out in Not all characters are created equal: take SYMBOLS, for example, there must be some order within the symbols "block", and the order is usually arbitrary.
And that gets us back on topic, to U+0020 (a.k.a. SPACE). It is a symbol, too.
"But wait,Michael!" you may be crying now. "The space actually represents an absence of symbols, or numbers, or letters, or anything. So it should not be a symbol!"
Well, this gets us kind of existential, which collation usually tries to avoid. It is based on expected behavior. So let's try some thought experiments to see where expected behavior leads us.
If you were comparing the strings "Microsoft" and "Micro soft", would you expect them to be equal?
Probably not.
But if SPACE were given no weight in collation, then they would always be identical. And what is more, the name Ray Mond would show up in the Exchange global address book after Raye. And all kinds of other weirdnesses.
So, it has to have some weight.
As perhaps a psychic nod to those who are philosophically against treating SPACE as a symbol, it is the very lightest of the true symbols. And from a behavior standpoint everything works, as long as you do not pass that NORM_IGNORESYMBOLS flag to CompareString and LCMapString.
This last paragraph may make some people wonder what I meant when I mentioned "true symbols" -- what are the symbols that are not true to us? Am I actually talking about relationships at this point, even though I said I was not?
I did not change my mind on the subject, I promise. :-) I am simply talking about a subcategory of symbols that are treated specially which weigh even less than the space -- the punctuation. They are the ones affected by word sort vs. string sort decisions (as I discuss here), and will weigh either less than the regular symbols (in the case of string sort) or less than even the difference between uppercase and lowercase letters (in the case of word sort).
Let's see some of this in action. If we look at the sort keys for several of these situations, what is happening underneath becomes more obvious:
Microsoft
0E 51 0E 32 0E 0A 0E 8A 0E 7C 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 00
Micro-soft (word sort)
0E 51 0E 32 0E 0A 0E 8A 0E 7C 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 80 1B 06 82 00
Micro-soft (string sort)
0E 51 0E 32 0E 0A 0E 8A 0E 7C 06 82 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 00
Micro soft
0E 51 0E 32 0E 0A 0E 8A 0E 7C 07 02 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 00
Microsoft / Micro-soft / Micro soft (NORM_IGNORESYMBOLS)
0E 51 0E 32 0E 0A 0E 8A 0E 7C 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 00
If you ignore symbols, they are all the same, otherwise the specific issues with the space, the hyphen, and word/string sort come into play.
Perhaps SPACE could have been a part of some bold new category that is not a symbol, but things are as they are -- and as it stands this returns intuitive results in most cases....
This post brought to you by " " (U+0020, a.k.a. SPACE)
# Maurits [MSFT] on 26 Mar 2006 1:47 AM:
# Michael S. Kaplan on 26 Mar 2006 1:54 AM:
# Mihai on 27 Mar 2006 12:29 PM:
# Michael S. Kaplan on 27 Mar 2006 2:19 PM:
# Susan Morehouse on 30 May 2008 1:26 PM:
I am trying to set up "myspace.com" and the program continually tells me that the password does'nt match.
referenced by