A few of the gotchas of WideCharToMultiByte

by Michael S. Kaplan, published on 2005/04/18 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/18/409095.aspx

I have talked a bunch of times about the way that different forms of strings that are canonically equivalent according to Unicode and which actually look identical visually exist in the world.

My uber typical example is comparing U+00e5 to U+0061 U+030a (å to å, or LATIN SMALL LETTER A WITH RING ABOVE to LATIN SMALL LETTER A + COMBINING RING ABOVE). They usually look the same, Unicode says they are the same, Unicode Normalization will convert between them, and even good old FoldString will do a job converting a lot of the cases.

So what happens when you use WideCharToMultiByte to convert a string containing one of these sets of characters from Unicode to a code page that contains the letter in question (like 0xe5 on code page 1252 for our friend A-Ring)?

Well, U+00e5 will always convert to 0xe5 with no problems. That one is easy.

However, U+0061 U+030a will not convert by default in the way you might want.

If you pass the WC_COMPOSITECHECK flag, then the good news is that it will try to map stuff like U+0061 U+030a to U+00e5. Of course the bad news is that it will use those same old tables that FoldString's MAP_PRECOMPOSED has been using for years. Which is missing some of what Unicode normalization defines.

If you pass the WC_DISCARDNS flag along with WC_COMPOSITECHECK, then non-spacing characters that do not find composite characters will be discarded. Since that table does not have all of the mappings in it, you may be losing some information when you pass this flag.

If you pass the WC_DEFAULTCHAR flag along with WC_COMPOSITECHECK, then anything not on the code page (even as a best fit mapping) will be replaced with the default character (usually the question mark).

If you pass the WC_NO_BEST_FIT_CHARS flag, then no best fit mappings (described further in If the shoe [best-]fits.... and BestBetter than nothing fit mappings, unleashed, #1) will happen. It is particularly relevant to our U+0061 U+030a case, since U+030a has a best fit mapping to U+00b0 (a.k.a. DEGREE SIGN). While one could argue endlessly about whether is better than a?, no one tends to disagree that just passing WC_COMPOSITECHECK and getting your is better than both of those options. Even if it is a little slower.

So you probably always want the WC_COMPOSITECHECK flag, just to have the best chance of getting the right mapping. In theory the incompleteness of the mappings it will support is a problem, but in practice most of them will not be on the code page anyway; you may be more likely to get a mapping (even if it is just a "best fit" one) with those incomplete tables. Since the code pages will never again change from version to version, and those compsite mappings have seen improvement from time to time....

Makes a person just wish they had stayed in Unicode all along, don't it? :-)


This post brought to you by "°" (U+00b0, a.k.a. DEGREE SIGN)

# Ben Bryant on 18 Apr 2005 11:10 AM:

You have an incredible blog!
Do you run across programs or APIs that generate composite characters? For example, an edit control producing a string containing composite characters? I would hope that most stuff avoids them whenever there is a single character equivalent, even in japanese characters. Obviously every program trying to display and process the Unicode string does not want to worry about this, right?

# Michael S. Kaplan on 18 Apr 2005 11:23 AM:

Thanks, Ben!

Most of the data on a Windows box will be in Unicode Normalization Form C (basically precomposed form). But anyone could use MSKLC to create a custom keyboard that does not do this, and obviously data coming from other platforms will follow its own rules.

If you are processing data and are unsure of where it is coming from and what form it might be in, passing the right flag can be very helpful. :-)

# Ben Karas on 18 Apr 2005 1:35 PM:

This is wonderful information. I wish MSDN would provide "You almost always want these flags..." in their documentation. Maybe you could poke the editors?

referenced by

2008/09/25 When to make a change, when to stay the same

2005/04/20 Encoding APIs and Security Concerns, APIs and Security Decisions

2005/04/19 A few of the gotchas of MultiByteToWideChar

go to newer or older post, or back to index or month or day