Vietnamese still ain't quite right

by Michael S. Kaplan, published on 2008/03/26 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/03/26/8337198.aspx

Michael Kaplan's personal blog not approved by Microsoft (see disclaimer)!

You may have read Vietnamese is a complex language on Windows, which discusses some fixes that were put in for Vietnamese in Vista.

The nature of the bug and the fixes that were put in was discussed there, as well -- it amounted to some characters used in the language but not included in the Vietnamese exception table -- characters also missing from the keyboard and the code page, and a few inconsistencies in weights found.

Anyway, recently the person who originally reported the bug commented on the Vista behavior:

Hi Michael,

Not until today, I got a chance to test the new collation on Vista. It seems that those bugs have been fixed for the Unicode composite format but not for the precomposed. Moreover, the fixes seem to have introduced new bugs. The list below includes the Vietnamese characters in question.

Reference : aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬiIìÌỉỈĩĨíÍịỊoOòÒỏỎõÕóÓọỌyYỳỲỷỶỹỸýÝỵỴ

Composite : aAàÀảẢãÃáÁạẠâÂầẦẩẨẫăĂẪằấẤẰẳậẲẬẵẴắẮặẶiIìÌỉỈĩĨíÍịỊoOòÒỏỎõÕóÓọỌyYỳỲỷỶỹỸýÝỵỴ

Precomposed: aAàÀảẢáÁạẠãÃăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬiIỉỈĩĨíÍịỊìÌoOỏỎóÓọỌòÒõÕyYỳỲỷỶỹỸỵỴýÝ

Given the latest environment (in terms of OS, .NET framework), how can I get the correct sort?

Quan

Let's take a closer look at the values he gave, marking the one that do not match the reference in red (blowing them up and putting the reference in the middle to make visual comparisons easier):

Composite : aAàÀảẢãÃáÁạẠâÂầẦẩẨẫăĂẪằấẤẰẳậẲẬẵẴắẮặẶiIìÌỉỈĩĨíÍịỊoOòÒỏỎõÕóÓọỌyYỳỲỷỶỹỸýÝỵỴ

Reference : aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬiIìÌỉỈĩĨíÍịỊoOòÒỏỎõÕóÓọỌyYỳỲỷỶỹỸýÝỵỴ

Precomposed: aAàÀảẢáÁạẠãÃăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬiIỉỈĩĨíÍịỊìÌoOỏỎóÓọỌòÒõÕyYỳỲỷỶỹỸỵỴýÝ

The information here is useful in the sense of repirting that therereporting are problems, but ultimately not in resolving the problems.

Perhaps I should explain what I mean by that. :-)

Some of the ones marked "incorrect" are done so in the absolute sense but relatively speaking some are correct -- but none of this is marked (in the end only a few of the last row entries are wrong in the relative sense);
None of the ones on the composite row use the composite forms, meaning you cannot reproduce the same reported issue with these strings for the composite characters;
None of the intermediate forms that sit somewhere between normalization forms C and D are represented in the composite row, either;
These rows of characters are not entirely useful since they do not distinguish between primary, secondary, tertiary, and quaternary differences;
None of the E or U characters are included in there, which suggests as minimum of 46 missing vowels -- and given the problem that happened by not looking at the full set, anything less than the full set is ultimately insufficient;
Other letters like đĐ are also not in there, which lead to the same kind of potential problem;
No references to dictionaries or government standards are given, which mean any actual contemplated fix would require that kind of supporting information.

Now ignoring all that for a moment and explaining a bit about the actual differences noted....

The nature of the problem here is twofold:

There are still some letters that are being identified as being Vietnamese that are not in the Vietnamese exception/compression tables -- they are put in where they are in the default table, which is in a few cases incorrect.
The identified diacritic marks were also "moved" in accordance with the expected weight results, even though some of the letters they combine with may not have been, given the above point (this causes the composite case to be better for the targeted letters though worse for some of the others.

As a rule, characters that are not used in a language do not tend to get moved along with the ones that are, but this particular discontinuity means that when such letters are sorted, their results may not be a 100% match for the precomposed/composite forms, which is why there is a difference between the two forms (the default table does match them, as do the identified Vietnamese letters; but when the moved diacritics combine with letters not in the known set, they will be moved out of matching their analogous precomposed forms).

This is a problem that cannot be fixed without a major version change, for Vietnamese, which would means a new version of Windows -- as I pointed out in 2001, a Correctness Odyssey (aka What's the matter with Ü?), it has been decided this kind of change cannot be done in a service pack....

So in any case, someone over in NLS has some investigation to do, for both repertoire and order, for a future version of Windows.

This blog brought to you by Đ (U+0110, aka LATIN CAPITAL LETTER D WITH STROKE)

# Quan Nguyen on 29 Mar 2008 4:01 PM:

The "Composite" line is the precomposed (NFC) result of sorting the decomposed (NFD) characters. I should have left it in its original format when reporting.

Out of curiosity, does .NET API support conversion from NFC or NFD to the intermediate forms?

After gazing the discrepancies awhile under your recast light, I discern that the composite got one half of the sort correct while the precomposed got the other half correct. One may start thinking that somehow by combining the sort results with the two forms, one could get the sort wholly correct, but it could get complicated, if feasible at all.

And even if that task could be accomplished, the contractions that are still in the table would throw another monkey wrench in the process -- causing, for instance, "co" is listed before "cho", which is correct for Vietnamese traditional but not for the current modern collation.

Come to think of it, shouldn't the .NET API include support for a rule-based collation that can be defined by developers? That way if there are issues with the Windows collation tables (like Vietnamese and the German phonebook sorting order) or needs for different sort orders for the same language, people do not have to wait for many years for the next releases of Windows (and hope that things will be fixed correctly then). That would be a much more flexible and viable solution.

# Michael S. Kaplan on 29 Mar 2008 4:11 PM:

Out of curiosity, does .NET API support conversion from NFC or NFD to the intermediate forms?

No, sorry -- that was the point of the other posts on the subject, to show how there is no direct way to get to the intermediate forms, or the forms that would be removed based on not being in the proper canonical form.

Come to think of it, shouldn't the .NET API include support for a rule-based collation that can be defined by developers?

Developers, by and large, are not linguists, and the road toward custom collations is one that is not likely to start in .NET (which is clearly moving if not out of the business of providing support outside the OS at least not deeper in).

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/03/09 Coloring outside the lines in the a-ness of the Hungarian Technical Sort

go to newer or older post, or back to index or month or day