Coloring outside the lines in the a-ness of the Hungarian Technical Sort

by Michael S. Kaplan, published on 2010/03/09 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/03/09/9975184.aspx

There is almost Beavis/Butthead joke in the title of this blog, isn't there? :-)

After, in the course of blogging, I blogged that blog on this Blog entitled Burn Windows Burn (aka If we want to unsay *this* one, we cannot say "Mu"), regular reader and colleague and friend Mihai commented:

There are other funny things about Hungarian technical sort when mixing the wide versions of the characters.

Below I will use the wide versions (U+FF21 and U+FF41) at left, and regular versions (U+0041 & U+0061) at right.

hu-HU_technl:

ａ < a
ａ > A
Ａ > A
Ａ > a

Ignore NORM_IGNOREWIDTH:

ａ < a
ａ == A
Ａ == a
Ａ > A

hu-HU seem to behave as expected.

(All on Win 7)

Now this is all essentially by design, sort of. Because the Hungarian Technical Sort only defines what is done with the specific characters that are generally used in Hungarian. So if you move some characters around and not others, then you will get such anomalies any time you compare characters from both sets (the modified ones and the unmodified ones).

This has even caused specific problems before any time the defined subset is incomplete, as discussed in blogs like Vietnamese is a complex language on Windows and Vietnamese still ain't quite right that show how the same problem with the underlying cause of the subset being incomplete simply makes the results look wrong -- but wrong in a way that does not help someone outside of the process easily discern the nature of the problem (or even help them describe it in a meaningful way to help report it, sometimes!).

Now this problem has come up kind of on th sidelines, as it were, both when it isn't a bug like in blogs such as See that version there? It is going down, man! #2 (aka Everybody WYNNs), and when it is a bug like in blogs such as If this post really describes a bug, would I actually put it in the WYNN column?.

Getting back to Hungarian and the Hungarian Technical Sort for a moment, my blog about it (Technically it *is* a hungarian sort) didn't tell the whole story.

In it, I gave three characteristics of the sort:

None of the compressions that I have talked about previously
None of those Hungarian double compressions, either
The uppercase letters come before the lowercase ones, unlike most other language collations on Microsoft products

But there is a fourth characteristic that I did not mention:

All of the letters with diacritics in Hungarian that are given secondary distinctions from the same letters without are, in this alternate sort, given primary distinctions

Because of this, letters like Á (U+00c1, aka LATIN CAPITAL LETTER A WITH ACUTE), which would be sorted as a variation of A with a secondary distinction are treated like they are unique letters with the same kind of standing as B or any other letter.

Taking a quick look at the "A-type letters" or alternately the "letters with A-ness" in the Hungarian Technical Sort (via those same tables I mentioned here):

0x0041 14 2 2 2 ;LATIN CAPITAL LETTER A
0x0061 14 2 2 18 ;LATIN SMALL LETTER A
0x00c1 14 3 2 2 ;LATIN CAPITAL LETTER A WITH ACUTE
0x00e1 14 3 2 18 ;LATIN SMALL LETTER A WITH ACUTE
0x00c2 14 4 2 2 ;LATIN CAPITAL LETTER A WITH CIRCUMFLEX
0x00e2 14 4 2 18 ;LATIN SMALL LETTER A WITH CIRCUMFLEX
0x00c4 14 5 2 2 ;LATIN CAPITAL LETTER A WITH DIAERESIS
0x00e4 14 5 2 18 ;LATIN SMALL LETTER A WITH DIAERESIS
0x0102 14 6 2 2 ;LATIN CAPITAL LETTER A WITH BREVE
0x0103 14 6 2 18 ;LATIN SMALL LETTER A WITH BREVE
0x0104 14 7 2 2 ;LATIN CAPITAL LETTER A WITH OGONEK
0x0105 14 7 2 18 ;LATIN SMALL LETTER A WITH OGONEK

Now these are the only letters with any kind of a-ness that this sort will flip the capitalization of. Note how each one has it's own unique alphabetic weight (AW), in red.

Now note that this will not break for Normalization Form D since each of these letters had their compression versions added too:

0x0041 0x0301 14 3 2 2 ;A With Acute
0x0061 0x0301 14 3 2 18 ;a With Acute
0x0041 0x0302 14 4 2 2 ;A With Circumflex
0x0061 0x0302 14 4 2 18 ;a With Circumflex
0x0041 0x0308 14 5 2 2 ;A With Diaeresis
0x0061 0x0308 14 5 2 18 ;a With Diaeresis
0x0041 0x0306 14 6 2 2 ;A With Breve
0x0061 0x0306 14 6 2 18 ;a With Breve
0x0041 0x0328 14 7 2 2 ;A With Ogonek
0x0061 0x0328 14 7 2 18 ;a With Ogonek

So everything is good, right?

But wait.

What about the letter A combined with other diacritics not on this small list (there are dozens commonly in use)?

Here is the table data of a bunch of those same characters and a few others nearby, including a symbol or two:

0x0061 14 2 2 2 ;LATIN SMALL LETTER A
0xff41 14 2 2 3 ;FULLWIDTH LATIN SMALL LETTER A
0x0041 14 2 2 18 ;LATIN CAPITAL LETTER A
0xff21 14 2 2 19 ;FULLWIDTH LATIN CAPITAL LETTER A
0x00e1 14 2 14 2 ;LATIN SMALL LETTER A WITH ACUTE
0x00c1 14 2 14 18 ;LATIN CAPITAL LETTER A WITH ACUTE
0x00e0 14 2 15 2 ;LATIN SMALL LETTER A WITH GRAVE
0x00c0 14 2 15 18 ;LATIN CAPITAL LETTER A WITH GRAVE
0x0227 14 2 16 2   ;LATIN SMALL LETTER A WITH DOT ABOVE
0x0226 14 2 16 18 ;LATIN CAPITAL LETTER A WITH DOT ABOVE
0x00e2 14 2 18 2 ;LATIN SMALL LETTER A WITH CIRCUMFLEX
0x00c2 14 2 18 18 ;LATIN CAPITAL LETTER A WITH CIRCUMFLEX
0x00e4 14 2 19 2 ;LATIN SMALL LETTER A WITH DIAERESIS
0x00c4 14 2 19 18 ;LATIN CAPITAL LETTER A WITH DIAERESIS
0x01ce 14 2 20 2 ;LATIN SMALL LETTER A WITH CARON
0x01cd 14 2 20 18 ;LATIN CAPITAL LETTER A WITH CARON
0x0103 14 2 21 2 ;LATIN SMALL LETTER A WITH BREVE
0x0102 14 2 21 18 ;LATIN CAPITAL LETTER A WITH BREVE
0x0101 14 2 23 2   ;LATIN SMALL LETTER A WITH MACRON
0x0100 14 2 23 18 ;LATIN CAPITAL LETTER A WITH MACRON
0x00e3 14 2 25 2 ;LATIN SMALL LETTER A WITH TILDE
0x00c3 14 2 25 18 ;LATIN CAPITAL LETTER A WITH TILDE
0x00e5 14 2 26 2   ;LATIN SMALL LETTER A WITH RING ABOVE
0x00c5 14 2 26 18 ;LATIN CAPITAL LETTER A WITH RING ABOVE
0x212b 14 2 26 26 ;ANGSTROM SIGN
0x0105 14 2 27 2 ;LATIN AMALL LETTER A WITH OGONEK
0x0104 14 2 27 18 ;LATIN CAPITAL LETTER A WITH OGONEK

With Normalization Form D they will have their uppercase forms precede their lowercase ones (since the sort fixes the base characters), while in Form C (and the fullwidth case Mihai mentioned) they won't.

The Microsoft collation story fails pretty firmly in such cases where you are not in the default table and you color outside the lines drawn for the particular scenario of the language?

Is this a bug?

I mean, the design goals of universal consistency pretty much stop with the default table - beyond that, such anomalies seem kind of by design.

But should that be the case? Fixing this would require either a design change or a huge explosion of the table size and a ton of testing either way. And all to support a bunch of scenarios which are undeniably not required by normal users.

That's a tough sell just to fix up the a-ness of Windows collation....

Mihai on 10 Mar 2010 10:38 AM:

Bug or not, I don't know.

One man's bug is another man's feature :-)

It probably doesn't affect anyone.

But somehow I would expect that when one uses NORM_IGNOREWIDTH then ａ == a and Ａ == A no matter the language.

Worth fixing for the 3 Hungarians that might be bothered by this once in 2 years? Probably not :-)

Michael S. Kaplan on 10 Mar 2010 6:02 PM:

Of course this bug is one example, there are many others. It is an interesting question, one I'd be tempted take up if I still worked in the area, as I have some thoughts on how it could be approached....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/07/13 I swear the Latvian bug is fixed; it was fixed 4.5 years ago!

go to newer or older post, or back to index or month or day