by Michael S. Kaplan, published on 2006/03/07 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/07/545097.aspx
(computerized apologies to Ray Charles for the title of the post!)
Will anyone forget when I asked the question What do you get when you combine a base character with a buttload of diacritics?
I was of course talking about fonts there. This time I am going to take a slightly different approach, and talk about collation.
I will give the string, the code points, and the sort key. We'll start simply, with one letter:
e
U+0065
0e 21 01 01 01 01 00
Now we will go with something a little more complicated (the difference from above marked in RED):
ẽ
U+1ebd
0e 21 01 19 01 01 01 00
or its alter ego in normalization form D:
ẽ
Hmmm... let's look at another diacritic:
U+00eaê
ê
U+0065 U+0302
0e 21 01 12 01 01 01 00
Ok, and now for the kicker:
ễ
U+1ec5
0e 21 01 29 01 01 01 00
ễ
U+0065 U+0302 U+0303
0e 21 01 29 01 01 01 00
But wait -- where did the 29 come from? I mean the first one had no DW (diacritc weight), and the next two had 19 and 12, respectively.
I had talked in previous posts about sort keys about how the minimal weight is 2, but that this weight would only be seen when it was needed as a placeholder, e.g. in the following string:
eễ
U+0065 U+1ec5
0e 21 0e 21 01 02 29 01 01 01 00
So, if you take that (sometimes invisible) 2 that as there on the 'e' always and combine it with the 17 on the tilde and the 10 on the circumflex, you get 29.
Easy.
Now what happens when you get that buttload of diacritics? Let's add them one at a time:
U+0065
0e 21 01 01 01 01 00U+0065 U+0300
0e 21 01 0f 01 01 01 00U+0065 U+0300 U+0301
0e 21 01 1b 01 01 01 00U+0065 U+0300 U+0301 U+0302
0e 21 01 2b 01 01 01 00U+0065 U+0300 U+0301 U+0302 U+0303
0e 21 01 42 01 01 01 00U+0065 U+0300 U+0301 U+0302 U+0303 U+0304
0e 21 01 57 01 01 01 00U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305
0e 21 01 95 01 01 01 00U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306
0e 21 01 a8 01 01 01 00U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307
0e 21 01 b6 01 01 01 00U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308
0e 21 01 c7 01 01 01 00U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309
0e 21 01 06 01 01 01 00U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309 U+030a
0e 21 01 1e 01 01 01 00U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309 U+030a U+030b
0e 21 01 39 01 01 01 00U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309 U+030a U+030b U+030c
0e 21 01 4b 01 01 01 00
Uh oh! Eventually we wrap....
We only have one byte of space to store that diacritic weight (any more than a byte would run into the next character's byte), and when we run out there were really only three choices:
The problem with #2 is that it pretty sharply limits what one could do in a potentially unpredictable way, and the problem with #1 is that all such strings would be equal. Now with option #3 there is a good chance that there will be a difference between strings being compared, though it will sometimes unfortunately make a string that is clearly greater than another string feeling like it is less than it -- a cure that may be worse than the disease....
Well, I won't argue whether one of the other choices might have been better; we are kind of stuck with it now (there are technically a few cases that wrap that are less theoretical than the case above, lest you try to dismiss the example as being a bit too unrealistic!).
But at least that answers the question about what happens when you try to collate a buttload of diacritics....
This post brought to you by "e" (U+0065, a.k.a. LATIN SMALL LETTER E)
Ben Bryant on 7 Mar 2006 7:32 AM:
Michael S. Kaplan on 7 Mar 2006 10:08 AM:
Maurits [MSFT] on 7 Mar 2006 11:13 AM:
Michael S. Kaplan on 7 Mar 2006 4:29 PM:
Maurits [MSFT] on 7 Mar 2006 4:59 PM:
Michael S. Kaplan on 7 Mar 2006 11:35 PM:
Maurits [MSFT] on 29 Mar 2006 11:48 AM:
Maurits [MSFT] on 29 Mar 2006 1:57 PM:
Michael S. Kaplan on 29 Mar 2006 2:05 PM:
Maurits [MSFT] on 29 Mar 2006 2:27 PM:
Michael S. Kaplan on 29 Mar 2006 4:02 PM:
Maurits [MSFT] on 29 Mar 2006 4:15 PM:
Michael S. Kaplan on 29 Mar 2006 4:25 PM:
Maurits [MSFT] on 29 Mar 2006 4:41 PM:
Maurits [MSFT] on 29 Mar 2006 6:01 PM:
referenced by
2010/09/07 Refusing to ignore some particular character's width isn't [always] an act of discrimination…
2007/09/12 A&P of Sort Keys, part 2 (aka The string that won? Didn't have a mark on him!)
2007/09/10 A&P of Sort Keys, part 0 (aka The empty string sorts the same in every language)
2006/09/19 Put in on my Tab, please
2006/05/31 Keeping out the undesirables?
2006/03/30 If at first you don't succeed, there's probably still a bug