How does Microsoft assign new collation weights?

by Michael S. Kaplan, published on 2005/09/12 00:01 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/09/12/463483.aspx

A few days ago, in response to my post about the silly NLS question, reader Gabe posted the following comment:

In all honesty, I can imagine somebody reading your blog, seeing you expound on various parts of Cyrillic, Chinese, and Sanskrit, and thinking that you actually know the languages written in those scripts. Or more likely, they might think that those are languages.

Much the way computer novices think that somebody who uses keyboard shortcuts must be a computer expert, it's pretty easy to assume that you must know some language because you know intricate details of its sort order. Hell, I once had an old Russian lady convinced that I was a native speaker because I was able to use all six words of the Russian vocabulary I know appropriately in conversation (with an accent I learned from movies).

And he makes an excellent point here. For the most part I do not speak these languages, but I have learned a lot about their scripts, the Unicode properties of their characters (when I way characters I mean in both the user sense and the code unit sense), and their various orderings across many locales. And although I am not one of the linguists who does much of the actual work of reverse engineering dictionaries and sorted word lists to determine what the collations are, I do work with them and have been the one checkin in a lot of their work and the code that makes use of it.

Plus occasionally I have done a few (though by no means all!) of the orderings myself in XP SP2 and Vista, for languages using Han (Hanzi/Kanji/Hanja), Hangul, Arabic, and other parts of the Unicode code space that someone with mere delusions of linguistic aptitude like myself can handle, with the data assistance from others. :-)

So after I read Gabe's response, I looked back in email and found the following question sent to me on the contact link during a period when the suggestion box was temporarily unavailable (problems on the MSDN Blogs site), by a developer named Susan:

I can't make the suggestion box submit my question, so hopefully you will not mind me contacting you directly.

I was wondering how your team actually decides what weight you assign in the default table that you mention in your post at http://blogs.msdn.com/michkap/archive/2004/12/08/278170.aspx.

It seems like there are times that the results do not match a particular language. That may be just that I do not know all the languages that the default table supports. But I think it would be an interesting post to geeks like me for you to explain how the decision is made!!!

Well, anything for a fellow geek, Susan.... And sorry I took so long to get to the question! :-)

To start with, it is an understatement to call it a decision -- they are actually a huge series of decisions, made over a long period of time. and the reasons are many and varied:

Sometimes, there is an actual ordering for a specific language we support and it does not conflict with any of the weights that are already there. When that happens, the new characters can simply be inserted, using existing space in the weight table.
Other times, there is an actual ordering for a specific language we support and it does conflict with weights that are already there. In those cases, we put it an an exception table. But of course we have to add it somewhere in the default table too, so we end up doing one of a few different things with code points not already there:
- We may add it in the place that one of those default table languages might expect it due to its appearance;
- We may add it in a place consistent with how other characters have been added in (apparently) similar situations;
- We may add it to the end of the list of characters in the script.
Still other times, we may not have a specific language that needs the script but are trying to fill out a subrange of things in Unicode, in which case either of those previous three mechanisms might be used.

All of this is done over the course of the last 10+ years by many different people (seven that I know of including myself!). It is definitely a situation where you are guaranteed to be consistent with some prior additions and inconsistent with some others, if you know what I mean.

So, let us look at some languages. There are the Cyrillic characters used in Russian:

А а Б б В в Г г Д д Е е Ё ё Ж ж З з И и Й й К к Л л М м Н н О о П п Р р С с Т т У у Ф ф Х х Ц ц Ч ч Ш ш Щ щ Ъ ъ Ы ы Ь ь Э э Ю ю Я я

Now compare that with some of the ones used in other languages that make use some or all of the time with the Cyrillic script, like Ukranian:

А а Б б В в Г г Ґ ґ Д д Е е Є є Ж ж З з И и І і Ї ї Й й К к Л л М м Н н О о П п Р р С с Т т У у Ф ф Х х Ц ц Ч ч Ш ш Щ щ Ю ю Я я Ь ь

Or Belarusian (a.k.a. Byelorussian):

А а Б б В в Г г Д д Е е Ё ё Ж ж З з І і Й й К к Л л М м Н н О о П п Р р С с Т т У у Ў ў Ф ф Х х Ц ц Ч ч Ш ш Ы ы Ь ь Э э Ю ю Я я

or Bulgarian:

А а Б б В в Г г Д д Е е Ж ж З з И и Й й К к Л л М м Н н О о П п Р р С с Т т У у Ф ф Х х Ц ц Ч ч Ш ш Щ щ Ъ ъ Ь ь Ю ю Я я

Or Macedonian:

А а Б б В в Г г Д д Ѓ ѓ Е е Ж ж З з Ѕ ѕ И и Ј ј К к Л л Љ љ М м Н н Њ њ О о П п Р р С с Т т Ќ ќ У у Ф ф Х х Ц ц Ч ч Џ џ Ш ш

Or Serbian:

А а Б б В в Г г Д д Ђ ђ Е е Ж ж З з И и Ј ј К к Л л Љ љ М м Н н Њ њ О о П п Р р С с Т т Ћ ћ У у Ф ф Х х Ц ц Ч ч Џ џ Ш ш

Or Kazakh:

А а Ә ә Б б В в Г г Ғ ғ Д д Е е Ё ё Ж ж З з И и Й й К к Қ қ Л л М м Н н Ң ң О о П п Ө ө Р р С с Т т У у Ұ ұ Ү ү Ф ф Х х Һ һ Ц ц Ч ч Ш ш Щ щ Ъ ъ Ы ы İ і Ь ь Э э Ю ю Я я

Or Kyrgyz:

А а Б б Г г Д д Е е Ё ё Ж ж З з И и Й й К к Л л М м Н н Ң ң О о Ө ө П п Р р С с Т т У у Ү ү Х х Ч ч Ш ш Ы ы Э э Ю ю Я я

Or Mongolian:

А а Б б В в Г г Д д Е е Ё ё Ж ж З з И и Й й К к Л л М м Н н О о Ө ө П п Р р С с Т т У у Ү ү Ф ф Х х Ц ц Ч ч Ш ш Щ щ Ъ ъ Ы ы Ь ь Э э Ю ю Я я

There are a lot of differences here, some of which are immediately apparent like more/fewer/different characters, and many others of which are described in the Wikipedia article about the Cyrillic script. And many of these differences are supported in the various locales on Windows.

Although to be perfectly honest, a few of the differences are not there yet, despite the fact that the locale is there. Occasionally (to give an example) if one of those seven people was looking at a character not used in an existing collation whose appearance and name (which has a 'with descender' or 'with upturn' in it) suggested it might have a secondary or diacritic difference, despite the fact that it actually is a separate letter that should have a primary weight (we were occasionally spoiled by typical usage in the Latin script!).

These are the kinds of things that can be considered bugs to fix on a future version of Windows, for obvious reasons.

The same thing can be said of some of the many languages that use the Arabic script (for example, proper Farsi collation support was not added until Windows 2000 SP1/XP SP1/Server 2003 SP1).

There is a serious effort to clean up such problems in Vista, because as 'minor' as such problems may appear to be when looking at the 50,000+ code points in the default table, they are obviously major if they are happening in a language that is your own. If you know what I mean. And this weight 'fixing' is happening in Vista for languages in many scripts across the Unicode space....

This post brought to you by "А" (U+0410, a.k.a. CYRILLIC CAPITAL LETTER A)
(A letter that is quite proud to be at the very beginning of all Cyrillic scripts!)

# Ivan Petrov on Thursday, September 15, 2005 5:46 PM:

Hi Michael,

I want to post here something very interesting, in my opinion, partly in a connection we the post above - I mean the bulgarian script (if I've understood you correctly).

So the post:

Alphabet of the modern Bulgarian writen language
------------------------------------------------

The present Bulgarian alphabet has 30 letters:

Аа Бб Вв Гг Дд Ее Жж Зз Ии Йй
Кк Лл Мм Нн Оо Пп Рр Сс Тт Уу
Фф Хх Цц Чч Шш Щщ Ъъ ь* Юю Яя

* there is NO word in Bulgarian, that begins with the capital letter ‘Ь’!

Script of the modern Bulgarian writen language
----------------------------------------------

Аа (А̀а̀) Бб Вв Гг Дд Ее (Ѐѐ) Жж Зз Ии (Ѝѝ) Йй
Кк Лл Мм Нн Оо (О̀о̀) Пп Рр Сс Тт Уу (У̀у̀)
Фф Хх Цц Чч Шш Щщ Ъъ (Ъ̀ъ̀) Ьь Юю (Ю̀ю̀) Яя (Я̀я̀)

Stressed vowels with grave accent in Bulgarian writen language & UNICODE
------------------------------------------------------------------------

1. Stressed vowels with grave accent in Bulgarian writen language:

А̀а̀ Ѐѐ Ѝѝ О̀о̀ У̀у̀ Ъ̀ъ̀ Ю̀ю̀ Я̀я̀ (16)

2. UNICODE situation:

Today in UNICODE we can find in precomposed form only 4 of all the 16 needed stressed vowels with grave accent used in the modern Bulgarian writen language:

Ѐ – CYRILLIC CAPITAL LETTER IE WITH GRAVE - http://www.fileformat.info/info/unicode/char/0400/index.htm
Ѝ – CYRILLIC CAPITAL LETTER I WITH GRAVE - http://www.fileformat.info/info/unicode/char/040D/index.htm
ѐ – CYRILLIC SMALL LETTER IE WITH GRAVE - http://www.fileformat.info/info/unicode/char/0450/index.htm
ѝ – CYRILLIC SMALL LETTER I WITH GRAVE - http://www.fileformat.info/info/unicode/char/045d/index.htm

Q: Why the Bulgarians need them (the Stressed vowels with grave accent)?
A: Because of the Shifting stress!
------------------------------------------------------------------------

As we can see at http://en.wikipedia.org/wiki/Bulgarian_language#Word_stress the Bulgarian written language has a distinctive stress: for example, въ̀лна /v'əlna/ ("wool") and вълна̀ /vəln'a/ ("wave") are only differentiated by stress.

Regards,
Ivan.

# Michael S. Kaplan on Thursday, September 15, 2005 8:30 PM:

Hi Ivan -- Precomposed forms are not required for representation, and never have been. I amhiping you will accept this some day. :-)

# Ivan Petrov on Friday, September 16, 2005 3:38 AM:

Hi Michael,
I remember that you have dedicated a whole post as a answer to my queston:

Can I get my characters into Unicode?
wich can be found on http://blogs.msdn.com/michkap/archive/2005/02/06/367985.aspx

So, I accept and understood that "Precomposed forms are not required for representation" clearly!

But I'm wondering in this case, why we've all of the Lattin vowels with grave and acute accent in Precomposed form in UNICODE? ... And this is why I think that the Stressed vowels used in the Bulgarian written language must be in UNICODE in Precomposed form!

Basically, my post above was mainly about the bulgarian script, which consist of 76 characters, not of 60!

Regards,
Ivan.

# Michael S. Kaplan on Friday, September 16, 2005 4:01 AM:

Everything you need can be represented with what is currently encoded.

# Ivan Petrov on Friday, September 16, 2005 4:11 AM:

Hi again,

Ok, you're the techlead ;-) ... so I agree with you ;-)

... Soon this days I will post in the Suggestion box a very interesting question about the support of UNICODE in Command Promt/Shall ...

... I saw that the MSKLC file size issue was now corrected! ... Good! ;-)

See you.

Regards,
Ivan.

# Michael S. Kaplan on Sunday, September 18, 2005 10:23 AM:

Ivan -- I wanted to point out that for all of the collation frameworks out there, providing single weights for multiple code points is trivial, as is providing keyboard options to type the data -- so precomposed forms are not required.

But I was curious about the collation side. When you compare the stressed to the nonstressed part:

Аа (А̀а̀)
Ее (Ѐѐ)
Ии (Ѝѝ)
Оо (О̀о̀)
Уу (У̀у̀)
Ъъ (Ъ̀ъ̀)
Юю (Ю̀ю̀)
Яя (Я̀я̀)

(note that for font support to work best here, the font just needs the hints in it for attachment points, something that should get better in Vista but I will forward this to some of the people in Typography to be sure)

Anyway, when comparing stressed to unstressed, are the stressed considered to be completely separate letters, like 'Ä' in Swedish, or does it just have a diacrituc difference, like in German?

# Ivan Petrov on Monday, September 26, 2005 2:52 PM:

Hi Michael.

Sorry for the late answer, but I've a work to do ...

... Anyway, ... when comparing stressed to unstressed letters the difference is diacritic, I mean that any stressed letter is the same as the unstressed letter, but with with grave accent, wich in this case is diacritic sign. As I explained above the grave accent is used because of the Shifting stress, which means that when I pronounce some words they may have two different meanings depending of that where I put the stress. So, when I write down one of this words, grammaticaly I'm obligated to put a grave accent on the this vowel of that word which makes the sence of it that I really mean.
The example above, which I will now repeat is very indicative! So:

въ̀лна /v'əlna/ ("wool") and вълна̀ /vəln'a/ ("wave")

Here in this example we have the word "вълна", which may have two different meanings dependig of where we put the stress. If I put the stress on the first vowel "ъ" the word means "wool", but if I put the stress on the second vowel - in this case the word means "wave".

I hope now that I was clear ;-)
If any questions, you're welcome.

... By the way on the "Suggest a Topic!" page disappeared my 'Bulgarian Locale' questions and suggestions. Does that mean that you're not interested in that or you'll post soon an answer of that topic?

Regards,
Ivan.

# Michael S. Kaplan on Monday, September 26, 2005 3:14 PM:

Thanks for the info, Ivan!

About the Bulgarian locale info your posted, I forwarded the info to the owner of locale data to look at what should and what could be updated. Once I did that, it did not make sense to leave there....

# Ivan Petrov on Monday, September 26, 2005 3:37 PM:

Hi again ;-)

... As I said above: You're always welcome! ;-) ... It's always a pleasure to comunicate with you! ... So, I'm happy, I've this possibility ....

By the way, is Shawn Steele the locale data owner guy? .....

# Michael S. Kaplan on Monday, September 26, 2005 4:32 PM:

No, Shawn is the dev owner of the code that uses the data, but it a PM with the help of SPMs who own the actual data decisions....

# Ivan Petrov on Monday, September 26, 2005 4:39 PM:

Ok, thanks Michael.

See you.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2009/02/04 The road to hell is paved with attempts at being compatible

2008/02/10 Microsoft still does not use the UCA; the converse is also true

2007/12/10 In SQL Server, different collations implies different ranges (aka Not every table has its THORN)

2007/08/12 Hello Madda, Hello Father (Iranian style)

2007/01/18 Even if it makes no sense it has to go somewhere

go to newer or older post, or back to index or month or day