The whole truth about MB_PRECOMPOSED and MB_COMPOSITE

by Michael S. Kaplan, published on 2009/05/27 10:01 -04:00, original URI:

As a by the way, this blog does NOT represent anything beyond my own personal thoughts. You could even blame it on my Tegretol dosage, to be perfectly honest (if the pain were not so intense I'd have skipped this med for sure). I am not even on the team that owns this code any more and I didn't own it when I was then. Just so you know....

Recently when Shawn posted Don't use MB_COMPOSITE, MB_PRECOMPOSED or WC_COMPOSITECHECK, there were a few things he didn't mention.

For example, there is the fact that in builds of Windows 7 prior to the final release, the behavior is designed to use normalization.

Now of course there are a bunch of cases not in the tables in this Microsoft technology that pre-dates Unicode Normalization by more than half a decade, but that is small change.

And yes it means that MB_PRECOMPOSED will be a no-op for most code pages but for Vietnamese will destroy that special form "V" described in blogs like Getting intermediate forms and Harder intermediate forms of characters and Frost's The Form Not Taken. Which destroys round-tripping for Vietnamese, at least.

But this too is chump change. Someone will care but it might have shipped before anyone noticed.

You know what did get noticed?

I'll give you a hint: MB_COMPOSITE.

And another hint: my own blog Stripping diacritics...., which uses Unicode Normalization Form D to do its work, just like MB_COMPOSITE would nominally be expected to convert the text to Form D,

You give up? Well look at the fixed version of that code, found in Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others).

Converting fully formed modern precomposed Hangul syllables to composite Jamo might be a fine idea if Microsoft had not been de facto enjoined nearly a decade ago from providing the fonts that would allow those conjoining Jamo to be composed (of course buffer sizes would be almost tripled but nothing is perfect!).

Given the current situation, however, this led to a disastrous situation.

Now with all that said, nearly every word that Shawn said in Don't use MB_COMPOSITE, MB_PRECOMPOSED or WC_COMPOSITECHECK is true. I just felt like the proper context was needed.

Especially in the context of the one part of his blog I really do disagree with:

Hopefully I've terrified you and you'll stop using these flags, perhaps using NormalizeString() if you really need similar behavior.

Since Unicode Normalization will cause many of the very same problems that these flags cause (e.g. poor roundtripping and especially the problems I note above with both Form C and Form D), scaring people away from the flags and suggesting the use of a function that in many contexts will produce results just as scary is probably not the best idea.

It was written at what I have learned is probably not the best point to write a blog: when one is in the process of losing and/or has just lost the argument against a change, since it can color one's argument in a specific direction that under ordinary circumstances one might not choose to do....



This post brought to you by(U+30d4, a.k.a. KATAKANA LETTER PI)

Ted on 15 Apr 2010 11:54 AM:

I never did get around to asking you a particular question I was wondering about this blog entry for quite some time: since you stated that NormalizeString is not the answer either, what is the answer? Or is the answer to not do anything at all?

Michael S. Kaplan on 15 Apr 2010 12:15 PM:

Usually my recommendation is not to do anything; the few people who are cross-platform have specific requirements and those requirements should guide the actions (e.g. don't go to form D if you need to support Korean)....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day