All right, mistakes were made #2 (What the %#$* is wrong with German Phonebook sorting?)

by Michael S. Kaplan, published on 2007/05/05 15:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/05/05/2430935.aspx


(Apologies once again for the Dogma/Carlin allusion in the title)

I'll start by posting the moral of the story first: Always comment your code so that others know what you were thinking!

Continuing on from the prior post -- All right, mistakes were made #1 (a.k.a. Expanding the EXPANSION table), and of course alluding to What the %#$* is wrong with German sorting? which itself alludes to the South Park movie....

I had just talked about how wrong I was (don't let the prior talk about other people confuse anyone -- I blame myself for the regression here

Back in the beginning of April, Sven Harazim asked in the microsoft.public.win32.programmer.international newsgroup in a post entitled CompareString / String.Compare works different on XP - Vista with German "Umlauts" öäüÖÄÜ:

Hello!

From Windows9x up do XP the following code returns 1 (strings are
different)

Win32
CompareString(LOCALE_USER_DEFAULT, 0, 'HÜBNER',
Length('HÜBNER'),'HUEBNER', Length('HUEBNER'))

.NET
String.Compare("HÜBNER", "HUEBNER")

Under Vista it returns CSTR_EQUAL

Why?

What's different to Vista.

Sven Harazim

Now the German phonebook sort has it's own special rules that make it different from all of the other sorts, in a way that I honestly never knew about until Björn Rettig explained to me why his name was being spelled Bjoern Rettig in the address book that was at the time keeping itself limited to ASCII (and first talked about here in Dere are qvestions? In zat <b>case</b>...). Basically, it sets up the following six equivalences:

Ä ≈ AE

ä ≈ ae

Ö ≈ OE

ö ≈ oe

Ü ≈ UE

ü ≈ ue

You may see where I am going with this now.

All of those cases where individual locales were overriding pointers to the EXPANSION table, and all that thinking about how in the future we might want to support adding EXPANSION entries in specific locales, it never occurred to any of us (and those who did it originally never remembered!) that one could simply add a pointer in a specific locale's EXCEPTION table just as easily a different locale might remove one....

Now for the first four entries of the list above for the German phonebook sort, there were already EXPANSION entries for Æ/æ/Œ/œ to become AE/ae/OE/oe, so it was easy to point Ä/ä/Ö/ö to the same entries in the phonebook sort's EXCEPTION table.

The last two entries were slightly harder since there is no UE LIGATURE in Unicode....

The solution? The one that was used on every prior version of Windows and every version of the .NET Framework?

Simple -- add the two EXPANSION table entries to turn Ü/ü into UE/ue, but then don't point to them in the default table. Only point to them in the EXCEPTION table for the German phonebook sort.

Oh, wait a minute -- those pointers were being generated at build time now, and thus they were being added to the DEFAULT table. And causing Ü to look like UE and ü to look like ue, worldwide. And hitting that problem that Sven reported.

Turns out that two of those cases that were "missed" were actually not missed at all -- they were left out intentionally, but everyone forgot that they were or why, there were no comments and no spec and no document to remind people about the clever little solution that we had just unintentionally undone. :-(

Oops.

We are all really sorry about that.

Though the story will have a happy ending -- in the form of a fix for both Longhorn Server and Vista SP1. And special thanks to Sven Harazim (with additional thanks to others who participated in the newsgroup thread) who helped us notice and subsequently work to fix the problem!).

I learned a lot through the whole exercise, as did others in development, program management, and testing. And I learned as much from my mistakes here as I did from the mistakes of others (though I admit my own mistakes are more embarrassing to me personally!).

Though now at least comments were added to make sure that this particular "feature" is never lost again! :-)

 

This post brought to you by Œ(U+0152, a.k.a. LATIN CAPITAL LIGATURE OE)


# Rolf Frei on 24 Dec 2007 10:23 AM:

I have installes SP1 RC1 onmy vista and it looks as this is still not fixed in SP1.

CompareString(LOCALE_USER_DEFAULT, 0, 'HÜBNER',

   Length('HÜBNER'),'HUEBNER', Length('HUEBNER'))

still returns CSTR_EQUAL, which is wrong.

# Michael S. Kaplan on 24 Dec 2007 11:45 AM:

Look at the link in the second comment here -- it is a trackback to the post that explains what is going on here..... or alternately just click here....


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/02/19 Insanity defined: In the real world -0 == 0, in Vista -0 < 0, and in Windows Server 2008 -0 ≮ 0

2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

2007/09/08 2001, a Correctness Odyssey (aka What's the matter with Ü?)

2007/05/17 If a bunch of specific Unicode characters can no longer live in the same apartment together, can they really claim that they needed their space?

go to newer or older post, or back to index or month or day