by Michael S. Kaplan, published on 2005/09/12 00:01 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/09/12/463483.aspx
A few days ago, in response to my post about the silly NLS question, reader Gabe posted the following comment:
In all honesty, I can imagine somebody reading your blog, seeing you expound on various parts of Cyrillic, Chinese, and Sanskrit, and thinking that you actually know the languages written in those scripts. Or more likely, they might think that those are languages.
Much the way computer novices think that somebody who uses keyboard shortcuts must be a computer expert, it's pretty easy to assume that you must know some language because you know intricate details of its sort order. Hell, I once had an old Russian lady convinced that I was a native speaker because I was able to use all six words of the Russian vocabulary I know appropriately in conversation (with an accent I learned from movies).
And he makes an excellent point here. For the most part I do not speak these languages, but I have learned a lot about their scripts, the Unicode properties of their characters (when I way characters I mean in both the user sense and the code unit sense), and their various orderings across many locales. And although I am not one of the linguists who does much of the actual work of reverse engineering dictionaries and sorted word lists to determine what the collations are, I do work with them and have been the one checkin in a lot of their work and the code that makes use of it.
Plus occasionally I have done a few (though by no means all!) of the orderings myself in XP SP2 and Vista, for languages using Han (Hanzi/Kanji/Hanja), Hangul, Arabic, and other parts of the Unicode code space that someone with mere delusions of linguistic aptitude like myself can handle, with the data assistance from others. :-)
So after I read Gabe's response, I looked back in email and found the following question sent to me on the contact link during a period when the suggestion box was temporarily unavailable (problems on the MSDN Blogs site), by a developer named Susan:
I can't make the suggestion box submit my question, so hopefully you will not mind me contacting you directly.
I was wondering how your team actually decides what weight you assign in the default table that you mention in your post at http://blogs.msdn.com/michkap/archive/2004/12/08/278170.aspx.
It seems like there are times that the results do not match a particular language. That may be just that I do not know all the languages that the default table supports. But I think it would be an interesting post to geeks like me for you to explain how the decision is made!!!
Well, anything for a fellow geek, Susan.... And sorry I took so long to get to the question! :-)
To start with, it is an understatement to call it a decision -- they are actually a huge series of decisions, made over a long period of time. and the reasons are many and varied:
All of this is done over the course of the last 10+ years by many different people (seven that I know of including myself!). It is definitely a situation where you are guaranteed to be consistent with some prior additions and inconsistent with some others, if you know what I mean.
So, let us look at some languages. There are the Cyrillic characters used in Russian:
А а Б б В в Г г Д д Е е Ё ё Ж ж З з И и Й й К к Л л М м Н н О о П п Р р С с Т т У у Ф ф Х х Ц ц Ч ч Ш ш Щ щ Ъ ъ Ы ы Ь ь Э э Ю ю Я я
Now compare that with some of the ones used in other languages that make use some or all of the time with the Cyrillic script, like Ukranian:
А а Б б В в Г г Ґ ґ Д д Е е Є є Ж ж З з И и І і Ї ї Й й К к Л л М м Н н О о П п Р р С с Т т У у Ф ф Х х Ц ц Ч ч Ш ш Щ щ Ю ю Я я Ь ь
Or Belarusian (a.k.a. Byelorussian):
А а Б б В в Г г Д д Е е Ё ё Ж ж З з І і Й й К к Л л М м Н н О о П п Р р С с Т т У у Ў ў Ф ф Х х Ц ц Ч ч Ш ш Ы ы Ь ь Э э Ю ю Я я
or Bulgarian:
А а Б б В в Г г Д д Е е Ж ж З з И и Й й К к Л л М м Н н О о П п Р р С с Т т У у Ф ф Х х Ц ц Ч ч Ш ш Щ щ Ъ ъ Ь ь Ю ю Я я
Or Macedonian:
А а Б б В в Г г Д д Ѓ ѓ Е е Ж ж З з Ѕ ѕ И и Ј ј К к Л л Љ љ М м Н н Њ њ О о П п Р р С с Т т Ќ ќ У у Ф ф Х х Ц ц Ч ч Џ џ Ш ш
Or Serbian:
А а Б б В в Г г Д д Ђ ђ Е е Ж ж З з И и Ј ј К к Л л Љ љ М м Н н Њ њ О о П п Р р С с Т т Ћ ћ У у Ф ф Х х Ц ц Ч ч Џ џ Ш ш
Or Kazakh:
А а Ә ә Б б В в Г г Ғ ғ Д д Е е Ё ё Ж ж З з И и Й й К к Қ қ Л л М м Н н Ң ң О о П п Ө ө Р р С с Т т У у Ұ ұ Ү ү Ф ф Х х Һ һ Ц ц Ч ч Ш ш Щ щ Ъ ъ Ы ы İ і Ь ь Э э Ю ю Я я
Or Kyrgyz:
А а Б б Г г Д д Е е Ё ё Ж ж З з И и Й й К к Л л М м Н н Ң ң О о Ө ө П п Р р С с Т т У у Ү ү Х х Ч ч Ш ш Ы ы Э э Ю ю Я я
Or Mongolian:
А а Б б В в Г г Д д Е е Ё ё Ж ж З з И и Й й К к Л л М м Н н О о Ө ө П п Р р С с Т т У у Ү ү Ф ф Х х Ц ц Ч ч Ш ш Щ щ Ъ ъ Ы ы Ь ь Э э Ю ю Я я
There are a lot of differences here, some of which are immediately apparent like more/fewer/different characters, and many others of which are described in the Wikipedia article about the Cyrillic script. And many of these differences are supported in the various locales on Windows.
Although to be perfectly honest, a few of the differences are not there yet, despite the fact that the locale is there. Occasionally (to give an example) if one of those seven people was looking at a character not used in an existing collation whose appearance and name (which has a 'with descender' or 'with upturn' in it) suggested it might have a secondary or diacritic difference, despite the fact that it actually is a separate letter that should have a primary weight (we were occasionally spoiled by typical usage in the Latin script!).
These are the kinds of things that can be considered bugs to fix on a future version of Windows, for obvious reasons.
The same thing can be said of some of the many languages that use the Arabic script (for example, proper Farsi collation support was not added until Windows 2000 SP1/XP SP1/Server 2003 SP1).
There is a serious effort to clean up such problems in Vista, because as 'minor' as such problems may appear to be when looking at the 50,000+ code points in the default table, they are obviously major if they are happening in a language that is your own. If you know what I mean. And this weight 'fixing' is happening in Vista for languages in many scripts across the Unicode space....
This post brought to you by "А" (U+0410, a.k.a. CYRILLIC CAPITAL LETTER A)
(A letter that is quite proud to be at the very beginning of all Cyrillic scripts!)
# Ivan Petrov on Thursday, September 15, 2005 5:46 PM:
# Michael S. Kaplan on Thursday, September 15, 2005 8:30 PM:
# Ivan Petrov on Friday, September 16, 2005 3:38 AM:
# Michael S. Kaplan on Friday, September 16, 2005 4:01 AM:
# Ivan Petrov on Friday, September 16, 2005 4:11 AM:
# Michael S. Kaplan on Sunday, September 18, 2005 10:23 AM:
# Ivan Petrov on Monday, September 26, 2005 2:52 PM:
# Michael S. Kaplan on Monday, September 26, 2005 3:14 PM:
# Ivan Petrov on Monday, September 26, 2005 3:37 PM:
# Michael S. Kaplan on Monday, September 26, 2005 4:32 PM:
# Ivan Petrov on Monday, September 26, 2005 4:39 PM:
referenced by
2009/02/04 The road to hell is paved with attempts at being compatible
2008/02/10 Microsoft still does not use the UCA; the converse is also true
2007/12/10 In SQL Server, different collations implies different ranges (aka Not every table has its THORN)
2007/08/12 Hello Madda, Hello Father (Iranian style)