by Michael S. Kaplan, published on 2005/02/13 10:33 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/13/371895.aspx
When you call the WideCharToMultiByte API with almost all code pages1 the number of possible characters that can be represented on the target code page is always going to be smaller than what Unicode can represent. When this happens, there are one of two possibilities:
But what is a best fit mapping?
Well, there is really little more than a warning in the Platform SDK:
For strings that require validation, such as file, resource and user names, always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme e.g., symbol for ‘∞’ (infinity) maps to 8 (eight) in some code pages.
This hints at the extremes to which these bit fit mappings can take us (unless you are one of those who feel that the infinity sign is just a hungover digit eight that has fallen and cannot get up -- in which case the mapping only goes in one direction).
What do these two behaviors have in common? Well, in both cases information has been lost -- whether you replace with the wrong charcter or a question mark, you are always losing a little bit of data. The best fit mappings are also pretty uneven. Here are some quick approximate counts:
The above just notes that for single-byte code pages the MBTABLE always has 256 entries and the WCTABLE has more. The DBCS code pages are a bit tougher to do since they are designed differently. But I think the above shows that the actual number of best-fit entries varies from code page to code page.
Some of the entries even make sense (e.g. 1256 does not have Arabic digits in it, so those digits are best-fit mapped to ASCII 0 to 9 -- beats question marks any day!).
Other entries are just funny (like the infinity turns to eight thing -- the next time someone tells you its just an eight-hour work day you will know why it seems to take forever!).
But most fall in between -- arguably better than nothing.
Perhaps that what they should have been called (rather than "best fit" mappings) -- "better than nothing fit" mappings. Seems like the real mappings are the ones that are the best fit. :-)
1 - Pretty much all of them other than UTF-7, UTF-8, and GB18030, in fact.
This post sponsored by "Å" and "Æ" (U+00c5 and U+00c6, a.k.a. LATIN CAPITAL LETTER A WITH RING ABOVE and LATIN CAPITAL LETTER AE)
Both of which "better than nothing fit" map to U+0041 (LATIN CAPITAL LETTER A) on code page 1250!
# Avi D on 28 Feb 2009 7:44 PM:
I guess that explains this: http://www.comsecglobal.com/FrameWork/Upload/SQL_Smuggling.pdf (in short, non-quote characters can be best-fit into a quote character, post-validation, leading to SQL Injection in some situations).
Apparently SQL Server calls WideCharToMultiByte with the WC_NO_BEST_FIT_CHARS set, when it shouldnt be. I was curious as to how this occurs, now I know :-). (If you can convince MSRC that this should be changed, I'd be thrilled)
referenced by
2012/02/20 Where short file names can fail
2008/05/08 In hindsight, they may have BEST FIT these files where the sun never shines
2006/11/16 The problem of string comparisons, WORD sorts, and the minus that is treated like the hyphen
2006/02/14 Every character has a story #18: U+06cc and U+064a (ARABIC LETTER FARSI YEH and ARABIC LETTER YEH)
2006/01/20 Getting the characters in a code page (the code)
2006/01/07 Getting the characters in a code page
2005/10/29 Why an 8-hour day seems to take forever
2005/04/18 A few of the gotchas of WideCharToMultiByte
2005/02/15 BestBetter than nothing fit mappings, unleashed, #1