Not all characters are created equal: take SYMBOLS, for example

by Michael S. Kaplan, published on 2005/01/19 14:02 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/19/356280.aspx


Although collation on Windows gives a weight to every single code point1, there are times that this does not really have an intuitive meaning.

What I mean to say is that there are times that the question "does string1 equal string2?" may have meaning but "does string1 come before string2?" really does not. Sometimes the order is entirely arbitrary and has no meaning other than the fact that the architcture requires it.

When it comes to symbols, for example. Like the Miscellaneous Symbols block in Unicode. I mean, in the greater scheme of things does it really matter if U+2601 (CLOUD) comes before or after U+2602 (UMBRELLA). One could say that we are planners (so we have the umbrella in case it rains one day, so the umbrella comes first) or spontaneous (so when we see the clouds we put on our best Christopher Robin voice and say 'tut tut, looks like rain' and go to buy the umbrella, meaning clouds come first).

Or maybe I should take U+2620, U+2621, U+2622, and U+2623 and call it the Coldplay block in honor of their song Warning Sign (which is one of the two Coldplay songs I am willing to play until my ears bleed, and no, the other song is not Yellow, thank you very much!).

Or if I am sane (no promise about me!) I realize that trying to do this when there are so many of these characters to handle that I just say "screw it". It really does not matter, the key is that clouds are not rain. And that SKULL AND CROSSBONES may not be CAUTION SIGN and is probably not RADIOACTIVE SIGN or BIOHAZARD SIGN. So they are put in code point order and the dreams of silly easter eggs like the Coldplay block stay unrealized (except unintentionally when they happen to be in that order in Unicode).

Because the order really does not matter.

Of course that is not always true of all symbols. If you are using CompareString, LCMapString, or comparisons with the CompareInfo class, you have a choice about string sorting vs. word sorting. This is documented nowhere more clearly than in The Platform SDK in winnls.h:

//
//  Sorting Flags.
//
//    WORD Sort:    culturally correct sort
//                  hyphen and apostrophe are special cased
//                  example: "coop" and "co-op" will sort together in a list
//
//                        co_op     <-------  underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        co-op     <-------  hyphen (punctuation)
//                        cork
//                        went
//                        were
//                        we're     <-------  apostrophe (punctuation)
//
//
//    STRING Sort:  hyphen and apostrophe will sort with all other symbols
//
//                        co-op     <-------  hyphen (punctuation)
//                        co_op     <-------  underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        cork
//                        we're     <-------  apostrophe (punctuation)
//                        went
//                        were
//
#define SORT_STRINGSORT           0x00001000  // use string sort method

So you could think of SORT_STRINGSORT or CompareOptions.StringSort as the "treat hyphen and apostrophe like the other symbols" flag. SORT_STRINGSORT may not be very descriptive but SORT_HYPHENANDAPOSTROPHEAREACTUALLYSYMBOLSDAMMIT is too much of a pain to type. And there is no SORT_WORDSORT or CompareOptions.WordSort, since that is the default anyway if you do not pass such a flag there.

But if you pass the NOM_IGNORESYMBOLS or the CompareOptions.IgnoreSymbols flags to their respective APIs they get ignored with all of the other symbols. So I guess they were not too special after all. Whether they are U+002d or U+ff0c or U+0027 or whatever, they are all symbols. All easily ignorable.

Almost all search engines do something like this, though of course most are smart enough to include the index of the plus sign for "C++" and there was a huge rush to index the pound sign when "C#" suddenly became something to search for. But heaven help those who try to search for the backslash like RTF tags -- thank goodness most of them are pretty unique anyway, so we can find most of them without the special indexing of the symbols. :-)

But one thing is definitely true -- there is a serious LETTER bias around these parts. If character types could be considered a protected class under anti-discrimination law, I think symbols would have a pretty compelling case against a lot of companies with deep pockets....

 

1 - Well, ignoring the stuff i pointed out in my 'The jury will give this string no weight' post, of course!

 

This post sponsored by "," (U+ff0c, FULLWIDTH COMMA)
A character whose class action appeal to be considered punctuation for sorting purposes (commas v. Microsoft) will be heard 13th Circuit  court of appeals on the 12th of Never.


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/08/12 Hello Madda, Hello Father (Iranian style)

2006/05/24 Is it punctuation, symbol, or diacritic?

2006/05/04 Sort the words, sort the strings

2006/03/25 I need my SPACE, symbolically speaking

go to newer or older post, or back to index or month or day