by Michael S. Kaplan, published on 2007/09/17 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/17/4950008.aspx
Previous posts in this series:
Here at Microsoft, there are a whole bunch of people who do the 20/20 program to help them lose a lot of weight. And a lot of people finish up the program a lot thinner than when they started. Of course even if you become tons thinner you are still the same person (it is almost like one of those Comcast phone service commercials everyone hates!), and everyone can still recognize you. It is to all the people I know who have gone through that this post in the series is dedicated to....
Prior to Unicode, the concept if the width of a character was in at least the sense of amount of space needed to store the character, pretty fundamental -- wide characters took literally twice as much space to store as their narrower counterparts (1 byte vs. 2 bytes).
Now over time the difference also tended to show visual distinction as well, thus
would be about twice as wide (if the same font was used) as
Though of course the magic of font linking can make the two look less distinct depending on you browser and OS language settings. :-)
Now although the two sets of characters are far apart in Unicode, no one ever expected anyone to not sort them together -- there is something fundamentally A-like1 about both Ａ and A, and there are very good reasons to expect both of them to sort before B.
But obviously they are not completely equivalent to each other.
And thus the notion of character width was brought to sorting -- it used to (but no longer) affects the storage size, it still affects fonts if you use the same font, and affects collation in regards to the "width" piece of the sort weight.
Our sample strings are just going to be plain old ASCII compared to the fullwidth versions of the same characters (future posts will get into Korean, Japanese, and Chinese examples that also deal with width, never fear!).
You can choose to ignore width differences via the NORM_IGNOREWIDTH flag, which literally just removes the bit on the sort key that indicates the character is full width. This has the additional benefit of shrinking sort key size if wide characters are there.
Also, generally speaking, in some cases but not all we tend to move the full width versions of characters when the half-width versions are moved -- as the examples below will show. There is some debate as to whether what we do here is in fact correct, since the main difference between them in the eyes of most people is display, and thus they are either not used in a language or are used in some way that the correct sorting behavior would be expected.
I tend to believe that what we do is incorrect any time we don't move them (in other words I believe moving is the correct thing to do), but no one is really complaining loudly enough at this point to make the change worthwhile. It is a similar point to whether one should move every letter with a particular base if one moves the base, and I am inclined to think we ought to, consistently.
I'll show examples where it gets weird just so you see what I am talking about....
And here are some samples:
U+ff21 Ａ 0e 02 01 01 13 01 01 00
U+ff41 ａ 0e 02 01 01 03 01 01 00
U+ff21 Ａ 0e 02 01 01 12 01 01 00 (w/NORM_IGNOREWIDTH)
U+ff21 Ａ 0e 02 01 01 03 01 01 00 (w/NORM_IGNORECASE)
U+0041 A 0e 02 01 01 12 01 01 00
Now a few things are immediately obvious -- like that WIDTH is stored in the CASE weight, but NORM_IGNORECASE has no effect on it. And also that it just adds 01 to the case weight any place one has a full width character.
How about with different languages? Well here is a Danish example:
en-US U+ff21 U+ff21 ＡＡ 0e 02 0e 02 01 01 13 13 01 01 00
en-US U+0041 U+0041 AA 0e 02 0e 02 01 01 12 12 01 01 00
da-DK U+ff21 U+ff21 ＡＡ 0e 02 0e 02 01 01 13 13 01 01 00
da-DK U+0041 U+0041 AA 0e b1 01 03 01 1a 01 01 00
And here is a Lithuanian example:
en-US U+ff29 Ｉ 0e 32 01 01 13 01 01 00
en-US U+0069 I 0e 32 01 01 12 01 01 00
en-US U+ff38 Ｘ 0e a6 01 01 13 01 01 00
en-US U+0058 X 0e a6 01 01 12 01 01 00
en-US U+ff39 Ｙ 0e a7 01 01 13 01 01 00
en-US U+0059 Y 0e a7 01 01 12 01 01 00
lt-LT U+ff29 Ｉ 0e 32 01 01 13 01 01 00
lt-LT U+0069 I 0e 32 01 01 12 01 01 00
lt-LT U+ff38 Ｘ 0e a6 01 01 13 01 01 00
lt-LT U+0058 X 0e a6 01 01 12 01 01 00
lt-LT U+ff39 Ｙ 0e 33 01 01 13 01 01 00
lt-LT U+0059 Y 0e 33 01 01 12 01 01 00
See how in the case of Lithuanian the Y-like characters were moved including the fullwidth ones, while for Danish they were not?
Now I know this is part of a bigger philosophical issue of what to do with letters that are generally not used in a language but which look a bit or more than a bit like ones that are.In general whether it comes to width, diacritic, or alternate case forms we have no consistent story -- some we move and some we do not.
Is it a bug? Well, maybe not, but it seems like it ought to be, since we kind of halfway do it. I doubt that the fullwidth Y moving was done at the behest of the Lithuanian Microsoft subsidiary. :-)
1 - I have been encouraged to stop using the term A-ness in public to avoid what I will henceforth refer to as the 'Beavis Effect', preferring instead the term A-like.
This post brought to you by 6 and 7 (U+0036 and U+0037, a.k.a. DIGIT SIX and DIGIT SEVEN)
go to newer or older post, or back to index or month or day