The road to hell is paved with attempts at being compatible

by Michael S. Kaplan, published on 2009/02/04 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2009/02/04/9394864.aspx

In one of the very first blogs I wrote, I pointed out that Microsoft does not use the Unicode Collation Algorithm.

Believe it or not, at the time some people actually asked me whether I thought I might get in trouble for that blog. Looking at it now I can't even imagine why they would have thought that -- there are so many other blogs that are much more effective at getting me into trouble, after all. I can inspire an almost Pavlovian response with certain topics, which inspire a "this is what I'm talking about" mail to some people.

Anyway....

Microsoft does not use the UCA. In fact, it still does not use the UCA.

There are consequences to this fact - that the collation model whose full time job is to attempt to implement principles in the Unicode Standard as it sorts is not the one that Microsoft does. Consequences that pop up at the most unlikely and unexpected times and can knee a guy right in the groin.

Like the other day, when I received from a guy named Ron a mail that was not as shiny and happy as REM imagined in that song of theirs that Michael Stipe hates so much:

This isn't a request for support, and I don't expect a response.

I just wanted to let you know that the hot fix in http://support.microsoft.com/kb/955612 leaves some sort keys broken. I've applied that hot fix and still get broken results for the Uncode code points FE71, FE77, FD79, FE7D, and FE7F.

For example the sort key for FE71 is

00 00 01 00 01 00 dc 01 01 01

I would not have sent this to you except that there's no way to give any feedback on the hot fix page other than to pay $99.00 to speak to a support person.

It really looks like MS doesn't want to find out when it's code is broken. Given that I used to work for MS, I find that depressing.

It would be nice to have this fixed in some future release.

Hmmm. I count seven issues that kind of screamed for a response of some sort.I'm gonna try to cover them all.

I'm going to take them out of order, though.

FOURTH OF ALL, the bug. If you compare the sort keys of some of these characters across versions (the first sort key is from XP, the second is from Vista, the third is from Server 2008):

U+fe71 (ARABIC TATWEEL WITH FATHATAN ABOVE)
01 01 01 01 80 07 06 a0 00
40 03 40 fa 01 01 02 12 01 01 00
00 00 01 00 01 00 dc 01 01 01 00

U+fe77 (ARABIC FATHA MEDIAL FORM)
01 01 01 01 80 07 06 a3 00
40 f8 01 00 40 dc 01 02 0d 01 02 00 12 01 01 00
00 00 01 00 01 00 df 01 01 01 00

U+fe7d (ARABIC SHADDAH ON TATWEEL)
ff ff 01 01 01 01 00
40 ea 40 fc 01 01 01 01 00
00 00 01 00 01 00 e3 01 01 01 00

U+fe7f (ARABIC SUKUN MEDIAL FORM)
01 01 01 01 80 07 06 a6 00
40 f2 40 fc 01 01 01 01 00
00 00 01 00 01 00 e2 01 01 01 00

The explanation of each is simple enough -- the first was from that point where many of the characters had weird weights just to try to fit them somewhere since they did not exactly fit in with the one weight per character model.

The second was an attempt to at least put them with the other Arabic characters.

The third was an attempt to be more compatible with Unicode, kind of like the UCA tries to do.

Oops.

It took the documented decompositions from the Unicode Character Database, and treated them like Expansions, those things I mentioned in A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons). And it turned these kind of complicated compatibility characters, the ones I have been railing against in prior blogs like

and tries to kind of rehabilitate them using these documented equivalencies, such as:

U+fe71 --> U+0640 U+064b (ARABIC TATWEEL + ARABIC FATHATAN)

U+fe77 --> U+0640 U+064e (ARABIC TATWEEL + ARABIC FATHAH)

U+fe7d --> U+0640 U+0651 (ARABIC TATWEEL + ARABIC SHADDAH)

U+fe7f --> U+0640 U+0652 (ARABIC TATWEEL + ARABIC SUKUN)

Now in Microsoft's tables, the TAWEEL is given no weight (ref: You've got to be kashidding me), and the other characters are treated as diacritics. This makes the XP weights just behind the times and the Vista weights really weird, with them being treated as full letters even though they are nominally compatible with things that are either weightless or diacritics.

Thus the first two attempts here sucked (the worst examples of How does Microsoft assign new collation weights?), and the third was a genuine attempt to do the right thing.

Unfortunately, there are at least three problems/limitations with our expansions, and this bug is due to two of them.

You see, in expansions all the usual code that does not fill in values for the weightless characters? Doesn't happen. Plus it does not properly handle combining characters (just like it does not handle compression, as I pointed out in A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)).And thus between these two problems you have all these NULLs Ron is pointing out.

Oops.

This bug repros on Windows 7 by the way. Someone should get on that ASAP. Any NLS testers around? :-)

Getting back to the remaining six point in the question, now:

FIRST OF ALL, this blog is not really intended to be a support venue, so SECOND OF ALL just like no one ever expects the Spanish Inquisition, one should never expect a response.

And THIRD OF ALL, the hotfix mentioned in that KB article was for a specific targeted bug. Being unhappy at a heretofore unreported bug not being fixed in it is like being mad that Apple did not provide a patch.

FIFTH OF ALL since this bug has nothing to do with the hotfix, that would be the wrong place to leave the feedback anyway.

And SIXTH OF ALL, when I consider the notion of a former employee who has no idea where to report a bug and finds to be depressing, I myself get depressed.

I know that for the next ten years after I leave this company I would know exactly where to send bugs, even if decided not to send them. :-)

Finally, SEVENTH OF ALL, given that this bug exists in Windows 7, I too think it would be nice if it were fixed in a future version. Hint, hint!

This blog brought to you by the many fine characters mentioned above that have been so consistently mistreated by Windows despite their long-standing existence in Unicode

Iain Clarke on 9 Feb 2009 11:08 AM:

OK, I'll bite - what was SECOND OF ALL?

Iain.

Michael S. Kaplan on 20 Jul 2010 8:22 AM:

On the same line as the FIRST OF ALL. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day