by Michael S. Kaplan, published on 2010/06/29 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/06/29/10031096.aspx
So, the other day I was asked a strange question about behavior.
In sorting, not my personal behavior!
As I have mentioned, most people forget that I haven’t owned the collation code in Windows for years now. Though I suppose the fact that I keep answering the questions anyway kind of means it’s my fault….
In my personal behavior I get people questioning it all the time (the most recent one was just days prior to this blog being posted) but that isn’t something I am blogging about now. Besides, that was just flirting and she started it (plus she suggested the complainer join the current century so I’m probably okay on this one).
Anyway, back to the topic.
It had to do with the fact that in Kazakh the following pairs of letters were not being considered equal in a case-insensitive comparison:
Ё (U+0401, aka CYRILLIC CAPITAL LETTER IO)
ё (U+0451, aka CYRILLIC SMALL LETTER IO)
Е (U+0415, aka CYRILLIC CAPITAL LETTER IE)
е (U+0435, aka CYRILLIC SMALL LETTER IE)
Intrigued, I looked into it a bit.
And I found the following Kazakh exception table entries, used in every version of Windows since XP:
0x0435 16 25 2 2 ;Cyrillic Small Ie (The small versions of Ie come after the capital as unique AW)
0x04bd 16 25 5 2 ;Cyrillic Small Ie Hook
0x0451 16 25 19 2 ;Cyrillic Small Io
0x04bf 16 25 23 2 ;Cyrillic Small Ie Hook Ogonek
with the following relevant default table entries (overridden entries marked appropriately):
0x0435 16 24 2 2 ;Cyrillic Small Ie
0x0415 16 24 2 18 ;Cyrillic Capital Ie
0x04bd 16 24 5 2 ;Cyrillic Small Ie Hook
0x04bc 16 24 5 18 ;Cyrillic Capital Ie Hook
0x0451 16 24 19 2 ;Cyrillic Small Io
0x0401 16 24 19 18 ;Cyrillic Capital Io
0x04bf 16 24 23 2 ;Cyrillic Small Ie Hook Ogonek
0x04be 16 24 23 18 ;Cyrillic Capital Ie Hook Ogonek
Note the clear case pair relationship in the default table that is essentially removed in the Kazakh exception table, by making each of the four lowercase letters a single new letter with a unique alphabetic weight (interestingly ignoring the secondary/diacritic weight – NORM_IGNORENONSPACE – will cause all four lowercase characters to be treated as equal and all four uppercase characters to be treated as equal!).
In any case, the data and the comment explains the behavior that was being seen, though not the reason for it.
(The small versions of Ie come after the capital as unique AW)
is a strange comment by pretty much most definitions of strangeness if you ask me, since it does not really tell you the source of the information or its veracity, its reliability. How to tell why such a thing would be expected without the source becomes rather complicated (I have been unable to verify it as of yet, though I have only looked for the same amount of time that a journalist claims someone was not available for comment and I never believe they tried hard enough so I probably haven’t tried hard enough yet either!
I often talk about how CASE != COLLATION (or CASE <> COLLATION for you VB types!) but it is usually in the other direction; these EIGHT characters may be the only four case pairs whose relationship exists in the casing table but not in the collation table (only for Kazakh). If it is really true maybe there should be a Kazakh-specific entry in the linguistic casing tables too!
The entries exist in the RTM version of XP and later, which will of course narrow down the “who” but that won’t really help either. I mean like from 10 years ago? Who are we kidding?
But is it inaccurate? Or is there really something in Kazakh that has been intending to be doing this all along – as farfetched as that may seem?
I mean no one has complained up until now in the last ten years, sure. But how often would the problem really get noticed if it is in fact a problem?
So, any Kazakh speakers have light to shed here on whether there is a reason for this?
Evan on 29 Jun 2010 7:40 AM:
My question is - Why are they referring to:
0x04bd [Cyrillic Small Ie Hook]
0x04bc [Cyrillic Capital Ie Hook]
0x04bf [Cyrillic Small Ie Hook Ogonek]
0x04be [Cyrillic Capital Ie Hook Ogonek]
as Ie-s [How do I pluralise this?!] with hooks, when they have always [to the best of my knowledge] been referred to as Abkhasian Che-s with/without a descender.
Michael S. Kaplan on 29 Jun 2010 7:53 AM:
Those are just comments, they aren't the official character names...
John Cowan on 29 Jun 2010 8:34 AM:
My guess is that the difference between YE and YO with or without hook/ogonek/descender is only interesting at the beginning of a proper name.
Michael S. Kaplan on 29 Jun 2010 12:30 PM:
Not sure I understand, John -- the "differences" are the same for both lowercase and uppercase, just no longer the same between them.
Random832 on 30 Jun 2010 5:10 AM:
Those comments had to come from somewhere, though. They are, in fact, the Unicode 1.0 names of those characters [which also had to come from somewhere, I suppose, but the story is probably a lot less interesting than that of the "Caron"].
Michael S. Kaplan on 30 Jun 2010 10:57 AM:
Remembering blogs like Every character has a story #29: U+1000^H^H^H^H0f40, (TIBETAN or MYANMAR LETTER KA, depending on when you ask), the original sorting tables were done so early that they were likely based on pre-Unicode 1.0. The later table for Kazakh (added in XP) probably just copied the code points and comments, just changing weights....
MAKKAM on 14 Nov 2010 4:14 AM:
As Kazakh native speaker I have to say that there is no any reason for this weird behavior. There is no special rules for that exceptions in Kazakh language. So, uppercase and lowercase ie-s and io-s officially correspond to each other. As well as there is no reason to compare uppercase and lowercase ie-s in a different way.
Io is used very rarely in kazakh language, mostly in foreign origin words.
Maybe it's too late to comment this post, but I hope answer makes sense here.
John Cowan on 26 Nov 2010 10:48 PM:
Unfortunately, I can't figure out what I meant either. Musta been a bad day at the skunk works.
go to newer or older post, or back to index or month or day