by Michael S. Kaplan, published on 2007/09/13 03:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/13/4889199.aspx
Previous posts in this series:
I had someone ask me what the A&P stood for in the titles of the posts in the series. This made me feel good since I don't think I had any previous proof that people were paying attention to them! :-)
A&P stands for Anatomy and Physiology, just a holdover from my days when my career had more of a medical aim....
So, let's move into today's post, shall we?
The topic is going to be about those items that fill up the case weight section of the sort key.
Our sample strings will be:
yyyyyyyy
yýÿỳỵỷỹʏ
YÝŸỲỴỶỸʏ
You might wonder Y I am using these strings, but it will become clear soon. The first row is obvious, but the second and third rows are the LATIN SMALL LETTER and LATIN CAPITAL LETTER versions of:
Y, Y WITH ACUTE, Y WITH DIAERESIS, Y WITH GRAVE, Y WITH DOT BELOW, Y WITH HOOK ABOVE, Y WITH TILDE, and then a LATIN LETTER SMALL CAPITAL Y at the end of both, a.k.a.
U+0059 U+00dd U+0178 U+1ef2 U+1ef4 U+1ef6 U+1ef8 U+028f
U+0079 U+00fd U+00ff U+1ef3 U+1ef5 U+1ef7 U+1ef9 U+028f
I am sure you are as curious as me what the sort keys look like, right? :-)
The case weights will be blue!
0e a7 0e a7 0e a7 0e a7 0e a7 0e a7 0e a7 0e a7 01 01 01 01 00
0e a7 0e a7 0e a7 0e a7 0e a7 0e a7 0e a7 0e a7 01 02 0e 13 0f 58 44 19 01 02 02 02 02 02 02 02 0a 01 01 00
0e a7 0e a7 0e a7 0e a7 0e a7 0e a7 0e a7 0e a7 01 02 0e 13 0f 58 44 19 01 12 12 12 12 12 12 12 0a 01 01 00
So similar to case weight, the cost is one byte of case weight and time the character has case weight. Of course just as with diacritics if placeholder minimal 02 bytes are needed.
You may notice that the SMALL letters have no weight (just that minimal 02 weight), and that CAPITAL letters have a 10 weight, and that most interestingly, that SMALL CAPITAL letter has a 0a weight.
Makes you wonder what other weights might be there for other characters, doesn't it?
Let's try something fun, like:
+₊﹢⁺+
a.k.a. U+002b (PLUS SIGN), U+208a (SUBSCRIPT PLUS SIGN), U+fe62 (SMALL PLUS SIGN), U+207a (SUPERSCRIPT PLUS SIGN), and U+002b (PLUS SIGN).
The sort key will look like this (again on Vista):
08 03 08 03 08 03 08 03 08 03 01 01 02 04 08 0e 01 01 00
So there we have it:
But luckily, there is probably a good compression story for these pieces of the sort key since in practice these weights will be pretty rare in text.
Another interesting case -- what if you pass NORM_IGNORECASE intending to strip some of this excess weight?
08 03 08 03 08 03 08 03 08 03 01 01 02 04 00 06 01 01 00
Aha, these aren't all case weights, but it looks like some of the weight is case related....
Plus, does anyone notice that unexpected 00 in there? Looks like it really was supposed to be 0a and not 08, now doesn't it?
Is there an NLS tester in the house who would like to put in a bug? :-)
If you look at the values, you'll see that the case weight seems to have a bit of a bitwise component in it -- we'll talk more about this later....
Anyway, that's probably enough for this episode. Stay tuned tomorrow for us to start making things a little more complicated!
This post brought to you by 3 (U+0033, DIGIT THREE)
# Zooba on 13 Sep 2007 5:42 AM:
Hi Michael,
Just letting you know that I'm very much enjoying this series so far. It's all completely new to me (like most of your blog) and very interesting. Right up there with some of Raymond Chen's series (I hope that doesn't jinx you...)
# Kyle M Cowan on 13 Sep 2007 8:39 AM:
Michael,
You may confuse people with this post slightly, because you go in and out of how you display numbers, and I think you might have also confused yourself a little.
"You may notice that the SMALL letters have no weight (just that minimal 02 weight), and that CAPITAL letters have a 12 weight, and that most interestingly, that SMALL CAPITAL letter has a 10 weight."
It should actually be 2 weight for small, 18 weight for capitals, and 10 weight for small capitals, or simply left in hex might have been an easier read. Referring to the 0x02 0x12 and 0x0a.
"Plus, does anyone notice that unexpected 00 in there? Looks like it really was supposed to be 10 and not 08, now doesn't it?"
If you submit the bug for this, you may want to clarify that it's 10 in Decimal, or 0x0A in hex. It looks like NORM_IGNORECASE toggles the '8' bit off. (or in more complicated text : Subtracts 8 if the number is greater than or equal to 8).
# Michael Dunn_ on 13 Sep 2007 1:01 PM:
Just to clear things up for myself (not nitpicking too much I hope): I think you swapped the 2nd and 3rd lines in the first group of sort keys. And sometimes you refer to ten as "10" and sometimes "0a"
# Michael S. Kaplan on 13 Sep 2007 2:06 PM:
No worries, that was not nitpicking -- and now fixed!
# Maurits [MSFT] on 14 Sep 2007 12:35 PM:
> CAPITAL letters have a 0c weight
? According to the sort key they have an 0x10 weight.
# Michael S. Kaplan on 14 Sep 2007 1:18 PM:
Indeed, you are correct (some over-correcting for a spot that had a decimal 10 weight)....
This post will be correct, some day!
referenced by
2010/03/06 Burn Windows Burn (aka If we want to unsay *this* one, we cannot say "Mu")
2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD
2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)
2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)
2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)
2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)
2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)
2007/09/18 A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)
2007/09/17 A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you)
2007/09/16 A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)
2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)