Vietnamese is a complex language on Windows

by Michael S. Kaplan, published on 2005/08/27 22:50 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/08/27/457224.aspx

Back in May of 2004, Quan Nguyen sent a message to Dr. International about Vietnamese collation in Windows and the .NET Framework:

This was not the only place that this information was asked -- Quan had asked this same question on several newsgroups and other places. We requested some more details, did the investigation, and were able to report on the claim -- he was right, there were a few letters that did not sort properly. In the end, the problem basically consisted of the uppercase and lowercase versions of the following letters:

Of course since these letters are in Unicode and are used by several other languages, they have some default weights -- but they are not in the Vietnamese exception table. And their weights in the default table are not completely correct....

Now no one had reported this problem before, so hopefully these are letters that are not used often in Vietnamese in situations where the small but definite differences in collation would be noticed.

Which is not to say it is not a bug or that it should not be fixed -- it definitely is.

But it is to perhaps explain why it took so long for someone to report to Microsoft a bug that has been in the code page and sorting tables since the very first Vietnamese enabled versions of Windows....

Now Windows code page 1258 has its own set of problems here, because the above characters are not in cp1258, either. Well, they sort of are as combining characters since the code page has U+0300, U+0301, and U+0303 on it -- but the conversion to and from Unicode of the above characters can be quite nightmarish, for the reasons I mention when I pointed out a few of the gotchas of MultiByteToWideChar. We would have had to include them as the precomposed form listed above, and there are not enough free slots to do so (even if we were able to modify code pages, which we are not when I explained about we cannot change the code pages).

So let's just assume that cp1258 is about as limited in use as all of the rest of the attempts at the other (at last count 42!) 8-bit encodings of Vietnamese are (they all have problems due to the fact that there are too mny characters or not enough slots to put them) and stick with Unicode....

Getting back to collation, this particular problem that Quan Nguyen reported is fixed in the updated sorting tables in ~~Longhorn~~Vista Beta 1. It could not be fixed in earlier versions of Windows or the .NET Framework as requires a major version change for Vietnamese to change the weights of code points that already have weights defined, so Vista is our first chance to make the fix (Whidbey's sorting tables are not being updated so the fix could not be made in .NET 2.0).

On a happier note, the font story for Vietnamese has been really good on Windows for a while now, for all of these various letters.

It just took a little while for the NLS side of GIFT to catch up with everyone else, that's all. :-)

This post brought to you by "Ý" (U+00dd, a.k.a. LATIN CAPITAL LETTER Y WITH ACUTE)

Hi Michael,

Not until today, I got a chance to test the new collation on Vista. It seems that those bugs have been fixed for the Unicode composite format but not for the precomposed. Moreover, the fixes seem to have introduced new bugs. The list below includes the Vietnamese characters in question.

Reference : aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬiIìÌỉỈĩĨíÍịỊoOòÒỏỎõÕóÓọỌyYỳỲỷỶỹỸýÝỵỴ

Composite : aAàÀảẢãÃáÁạẠâÂầẦẩẨẫăĂẪằấẤẰẳậẲẬẵẴắẮặẶiIìÌỉỈĩĨíÍịỊoOòÒỏỎõÕóÓọỌyYỳỲỷỶỹỸýÝỵỴ

Precomposed: aAàÀảẢáÁạẠãÃăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬiIỉỈĩĨíÍịỊìÌoOỏỎóÓọỌòÒõÕyYỳỲỷỶỹỸỵỴýÝ

Given the latest environment (in terms of OS, .NET framework), how can I get the correct sort?

Quan

Kind of unfortunate that you just got the chance to look at it, I mean since it was available to be looked at in beta form literally for years before Vista was released. :-(

But I will forward it on to the team to take a look at the issue for consideration in a future version.

There is no way to get "the correct sort" until the correct sort is added, though....

It is unfortunate that no one else had verified the changes while Vista was still in beta. It may have been assumed that Microsoft would do it correctly.

I also notice that the current Vietnamese collation implementation in both Windows XP and Vista has erroneously included several letter contractions. The Vietnamese modern collation does not have any contractions, as opposed to the traditional one. As such, "chó" should collate before "có".

http://developer.mimer.com/collations/charts/vietnamese.htm

http://developer.mimer.com/collations/charts/vietnamese_traditional.htm

Hope all the collation problems mentioned will be fixed for both precomposed and composite forms in the near future.

Thanks.

These changes can only happen in major versions of Windows (so "near future" is pretty unlikely without a version that is even in beta), and fixes are not (and cannot be) based on other companies providing their own sorting tables from their products -- so the mimer SQL tables are not something that we could or would use as source material.

But as I said, I have forwarded the information for consideration in a future version.

FWIW, please note that providing a list like you did is not specifically helpful since it gives no information about primary vs. secondary vs. tertiary vs. quaternary weights, and one could therefore produce that order and still have wrong results in actual real world words in the language....

Those tables only serve as graphical illustrations of what currently is in the Unicode Collation Algorithm. The DUCET data at UCA should provide complete reference information for the collation implementation.

Btw, do I understand correctly that SQL Server 2000/2005 also rely on Windows' Collation Table to sort and compare character strings, or they have their own collation table?

Microsoft does not use the UCA, which makes attempting to use UCA tailorings not quite so useful -- even if the other reasons did not exist (and they do). Actual dictionaries and/or government standards information is required....

SQL Server uses their own snapshots of the Windows collations at various points; they do not directly call the Win32 NLS API.

Hi Michael,

I'm truly dismayed to find out that the upcoming Windows 7 still has not got the Vietnamese collation right! Same error. Did Microsoft Internationalization Team take a look at this bug at all after all these years???

With the final release still about at least a half year away, can Microsoft correct this serious bug -- I'm speaking from the perspective of Vietnamese developers and users -- and get the fix into the final baseline?

Look forward to your response.

Quan

Quan,

I am no longer on the team responsible for this data, so I cannot speak for them. I have forwarded the information on to them.

But every single time I have discussed this bug here I have pointed out flaws in the report that keep it from being terribly actionable, at the same time that I discussed possible reasons for the differences.

I suppose I could be dismayed that a person who reports a problem is given feedback on issues with the report that can hinder and/or block the resolution that are never answered, but I am not. Because thus far there is only person making the complaint about the problem, and that is the person who has not answered the queries given. Thus, lacking other complaints about the problem, the problem for the present still seems to be unactionable. :-(

I did cite three sources that contain complete information about primary, secondary, and tertiary weights of Vietnamese characters. Info about the quaternary differences can be found at http://vietunicode.sourceforge.net/charset/quytacABC_en.html.

However, in case those sources are still deemed insufficient evidence and an actual dictionary is required, then perhaps your colleagues in Vietnam certainly could provide you a copy; or a visit to any bookstore in Vietnamese town at your locality, if there is a sizable population of Vietnamese expatriates there, could present an excellent opportunity to acquire one; or if time is a critical factor, I would be glad to donate my extra copy of the Vietnamese Dictionary published in 2003 by Vietnam Lexicography Centre to serve as a standard reference and design specifications for Vietnamese collation.

As users in Vietnam are becoming more aware of the problem with Windows’ Vietnamese collation data, they surely will make their case known to Windows Beta Team in the days ahead.

Thanks.

Hi Michael,

It's a very old post but still important. I was wondering whether there is some kind of workaround for WideCharToMultiByte in Vietnamese. Everything is OK except for the letters with double diacritics, like ệ or ữ.

What I don't understand is how it is different from Thai. Thai worked without any tweaking!

Is it possible to change the input and then recreate the characters that the conversion could not handle? I mean they exist in both cp1258 and Unicode, so this should be possible.

Best regards,

Vadim