Kazakh it to me, aka On being small and unique

by Michael S. Kaplan, published on 2010/06/29 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/06/29/10031096.aspx

As I have mentioned, most people forget that I haven’t owned the collation code in Windows for years now. Though I suppose the fact that I keep answering the questions anyway kind of means it’s my fault….

In my personal behavior I get people questioning it all the time (the most recent one was just days prior to this blog being posted) but that isn’t something I am blogging about now. Besides, that was just flirting and she started it (plus she suggested the complainer join the current century so I’m probably okay on this one).

It had to do with the fact that in Kazakh the following pairs of letters were not being considered equal in a case-insensitive comparison:

And I found the following Kazakh exception table entries, used in every version of Windows since XP:

with the following relevant default table entries (overridden entries marked appropriately):

Note the clear case pair relationship in the default table that is essentially removed in the Kazakh exception table, by making each of the four lowercase letters a single new letter with a unique alphabetic weight (interestingly ignoring the secondary/diacritic weight – NORM_IGNORENONSPACE – will cause all four lowercase characters to be treated as equal and all four uppercase characters to be treated as equal!).

In any case, the data and the comment explains the behavior that was being seen, though not the reason for it.

is a strange comment by pretty much most definitions of strangeness if you ask me, since it does not really tell you the source of the information or its veracity, its reliability. How to tell why such a thing would be expected without the source becomes rather complicated (I have been unable to verify it as of yet, though I have only looked for the same amount of time that a journalist claims someone was not available for comment and I never believe they tried hard enough so I probably haven’t tried hard enough yet either!

I often talk about how CASE != COLLATION (or CASE <> COLLATION for you VB types!) but it is usually in the other direction; these EIGHT characters may be the only four case pairs whose relationship exists in the casing table but not in the collation table (only for Kazakh). If it is really true maybe there should be a Kazakh-specific entry in the linguistic casing tables too!

The entries exist in the RTM version of XP and later, which will of course narrow down the “who” but that won’t really help either. I mean like from 10 years ago? Who are we kidding?

But is it inaccurate? Or is there really something in Kazakh that has been intending to be doing this all along – as farfetched as that may seem?

I mean no one has complained up until now in the last ten years, sure. But how often would the problem really get noticed if it is in fact a problem?

So, any Kazakh speakers have light to shed here on whether there is a reason for this?

My question is - Why are they referring to:

0x04bd [Cyrillic Small Ie Hook]
0x04bc [Cyrillic Capital Ie Hook]
0x04bf [Cyrillic Small Ie Hook Ogonek]
0x04be [Cyrillic Capital Ie Hook Ogonek]

as Ie-s [How do　I pluralise this?!] with hooks, when they have always [to the best of my knowledge] been referred to as Abkhasian Che-s with/without a descender.

My guess is that the difference between YE and YO with or without hook/ogonek/descender is only interesting at the beginning of a proper name.

Not sure I understand, John -- the "differences" are the same for both lowercase and uppercase, just no longer the same between them.

Those comments had to come from somewhere, though. They are, in fact, the Unicode 1.0 names of those characters [which also had to come from somewhere, I suppose, but the story is probably a lot less interesting than that of the "Caron"].

Remembering blogs like Every character has a story #29: U+1000^H^H^H^H0f40, (TIBETAN or MYANMAR LETTER KA, depending on when you ask), the original sorting tables were done so early that they were likely based on pre-Unicode 1.0. The later table for Kazakh (added in XP) probably just copied the code points and comments, just changing weights....

Hello, Michael!

As Kazakh native speaker I have to say that there is no any reason for this weird behavior. There is no special rules for that exceptions in Kazakh language. So, uppercase and lowercase ie-s and io-s officially correspond to each other. As well as there is no reason to compare uppercase and lowercase ie-s in a different way.

Io is used very rarely in kazakh language, mostly in foreign origin words.

Maybe it's too late to comment this post, but I hope answer makes sense here.

Unfortunately, I can't figure out what I meant either. Musta been a bad day at the skunk works.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.