I need my SPACE, symbolically speaking

by Michael S. Kaplan, published on 2006/03/25 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/25/560416.aspx


(No, this is not a post about anyone breaking up with me and telling me that they need their space)

In Microsoft's implementation of collation, we have several different categories of characters, and rules for dealing with each category.

One of the interesting categories is the SYMBOL category. All of the miscellaneous odd symbols show up here. And they all come before the various letters and numbers.

Of course, as a feature most of the symbols do not really have any linguistic meaning that would foster a set of rules for how to sort them. So as I pointed out in Not all characters are created equal: take SYMBOLS, for example, there must be some order within the symbols "block", and the order is usually arbitrary.

And that gets us back on topic, to U+0020 (a.k.a. SPACE). It is a symbol, too.

"But wait,Michael!" you may be crying now. "The space actually represents an absence of symbols, or numbers, or letters, or anything. So it should not be a symbol!"

Well, this gets us kind of existential, which collation usually tries to avoid. It is based on expected behavior. So let's try some thought experiments to see where expected behavior leads us.

If you were comparing the strings "Microsoft" and "Micro soft", would you expect them to be equal?

Probably not.

But if SPACE were given no weight in collation, then they would always be identical. And what is more, the name Ray Mond would show up in the Exchange global address book after Raye. And all kinds of other weirdnesses.

So, it has to have some weight.

As perhaps a psychic nod to those who are philosophically against treating SPACE as a symbol, it is the very lightest of the true symbols. And from a behavior standpoint everything works, as long as you do not pass that NORM_IGNORESYMBOLS flag to CompareString and LCMapString.

This last paragraph may make some people wonder what I meant when I mentioned "true symbols" -- what are the symbols that are not true to us? Am I actually talking about relationships at this point, even though I said I was not?

I did not change my mind on the subject, I promise. :-)  I am simply talking about a subcategory of symbols that are treated specially which weigh even less than the space -- the punctuation. They are the ones affected by word sort vs. string sort decisions (as I discuss here), and will weigh either less than the regular symbols (in the case of string sort) or less than even the difference between uppercase and lowercase letters (in the case of word sort).

Let's see some of this in action. If we look at the sort keys for several of these situations, what is happening underneath becomes more obvious:

Microsoft

0E 51 0E 32 0E 0A 0E 8A 0E 7C 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 00

Micro-soft (word sort)

0E 51 0E 32 0E 0A 0E 8A 0E 7C 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 80 1B 06 82 00

Micro-soft (string sort)

0E 51 0E 32 0E 0A 0E 8A 0E 7C 06 82 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 00

Micro soft

0E 51 0E 32 0E 0A 0E 8A 0E 7C 07 02 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 00

Microsoft / Micro-soft / Micro soft (NORM_IGNORESYMBOLS)

0E 51 0E 32 0E 0A 0E 8A 0E 7C 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 00

If you ignore symbols, they are all the same, otherwise the specific issues with the space, the hyphen, and word/string sort come into play.

Perhaps SPACE could have been a part of some bold new category that is not a symbol, but things are as they are -- and as it stands this returns intuitive results in most cases....

 

This post brought to you by " " (U+0020, a.k.a. SPACE)

 

 

 


# Maurits [MSFT] on 26 Mar 2006 1:47 AM:

> some bold new category that is not a symbol

"whitespace," perhaps ;-)

# Michael S. Kaplan on 26 Mar 2006 1:54 AM:

Ah, but that gets us into a whole new thing -- how do you describe the expectations of collation of text with "whitespace" as opposed to text with "symbols", really?

In the end, there is not all that much that is different. What was that expression? "A difference that makes no difference, make no difference?" :-)

# Mihai on 27 Mar 2006 12:29 PM:

Some guys in marketing should be reading this post.
Then we might stop seeing products called "C#" and ".NET" that messed-up all the search engines :-)

# Michael S. Kaplan on 27 Mar 2006 2:19 PM:

The person who I believe came up with both of those things is no longer at Microsoft (he left shortly after the new names were announced). The previous name (NGWS) was not so much better, though at least more searchable!

# Susan Morehouse on 30 May 2008 1:26 PM:

I am trying to set up "myspace.com" and the program continually tells me that the password does'nt match.


referenced by

2012/07/16 if you see a ZWNBSP in the Release Preview, don't be insensitive and comment it hasn't been eating enough lately!

2008/01/25 On reversing the irreversible (grabbing the data, part I)

2007/05/17 If a bunch of specific Unicode characters can no longer live in the same apartment together, can they really claim that they needed their space?

2006/11/01 If you add enough characters to a sort, intuitive distinction can suffer

2006/10/31 "àèìòù" < "äëïöü" but "àèìòù " > "äëïöü"

2006/10/01 Logical StrCmpLogicalW changes in Vista

2006/04/21 'universal-character-name encountered in source'

go to newer or older post, or back to index or month or day