by Michael S. Kaplan, published on 2005/04/18 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/18/409095.aspx
I have talked a bunch of times about the way that different forms of strings that are canonically equivalent according to Unicode and which actually look identical visually exist in the world.
My uber typical example is comparing U+00e5 to U+0061 U+030a (å to å, or LATIN SMALL LETTER A WITH RING ABOVE to LATIN SMALL LETTER A + COMBINING RING ABOVE). They usually look the same, Unicode says they are the same, Unicode Normalization will convert between them, and even good old FoldString will do a job converting a lot of the cases.
So what happens when you use WideCharToMultiByte to convert a string containing one of these sets of characters from Unicode to a code page that contains the letter in question (like 0xe5 on code page 1252 for our friend A-Ring)?
Well, U+00e5 will always convert to 0xe5 with no problems. That one is easy.
However, U+0061 U+030a will not convert by default in the way you might want.
If you pass the WC_COMPOSITECHECK flag, then the good news is that it will try to map stuff like U+0061 U+030a to U+00e5. Of course the bad news is that it will use those same old tables that FoldString's MAP_PRECOMPOSED has been using for years. Which is missing some of what Unicode normalization defines.
If you pass the WC_DISCARDNS flag along with WC_COMPOSITECHECK, then non-spacing characters that do not find composite characters will be discarded. Since that table does not have all of the mappings in it, you may be losing some information when you pass this flag.
If you pass the WC_DEFAULTCHAR flag along with WC_COMPOSITECHECK, then anything not on the code page (even as a best fit mapping) will be replaced with the default character (usually the question mark).
If you pass the WC_NO_BEST_FIT_CHARS flag, then no best fit mappings (described further in If the shoe [best-]fits.... and BestBetter than nothing fit mappings, unleashed, #1) will happen. It is particularly relevant to our U+0061 U+030a case, since U+030a has a best fit mapping to U+00b0 (a.k.a. DEGREE SIGN). While one could argue endlessly about whether a° is better than a?, no one tends to disagree that just passing WC_COMPOSITECHECK and getting your å is better than both of those options. Even if it is a little slower.
So you probably always want the WC_COMPOSITECHECK flag, just to have the best chance of getting the right mapping. In theory the incompleteness of the mappings it will support is a problem, but in practice most of them will not be on the code page anyway; you may be more likely to get a mapping (even if it is just a "best fit" one) with those incomplete tables. Since the code pages will never again change from version to version, and those compsite mappings have seen improvement from time to time....
Makes a person just wish they had stayed in Unicode all along, don't it? :-)
This post brought to you by "°" (U+00b0, a.k.a. DEGREE SIGN)
# Ben Bryant on 18 Apr 2005 11:10 AM:
# Michael S. Kaplan on 18 Apr 2005 11:23 AM:
# Ben Karas on 18 Apr 2005 1:35 PM:
referenced by