If the shoe [best-]fits....

by Michael S. Kaplan, published on 2005/02/13 10:33 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/13/371895.aspx

When you call the WideCharToMultiByte API with almost all code pages¹ the number of possible characters that can be represented on the target code page is always going to be smaller than what Unicode can represent. When this happens, there are one of two possibilities:

If you did not pass the WC_NO_BEST_FIT_CHARS flag and there is a "best fit" mapping, then the best fit mapping will happen.
If you did pass the WC_NO_BEST_FIT_CHARS flag or if there is no "best fit" mapping, then the default character will be placed in the target.

But what is a best fit mapping?

Well, there is really little more than a warning in the Platform SDK:

For strings that require validation, such as file, resource and user names, always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme e.g., symbol for ‘∞’ (infinity) maps to 8 (eight) in some code pages.

This hints at the extremes to which these bit fit mappings can take us (unless you are one of those who feel that the infinity sign is just a hungover digit eight that has fallen and cannot get up -- in which case the mapping only goes in one direction).

What do these two behaviors have in common? Well, in both cases information has been lost -- whether you replace with the wrong charcter or a question mark, you are always losing a little bit of data. The best fit mappings are also pretty uneven. Here are some quick approximate counts:

874 -- 138 characters
1250 -- 437 characters
1251 -- 384 characters
1252 -- 442 characters
1253 -- 366 characters
1254 -- 438 characters
1255 -- 96 characters
1256 -- 288 characters
1257 -- 94 characters
1258 -- 94 characters

The above just notes that for single-byte code pages the MBTABLE always has 256 entries and the WCTABLE has more. The DBCS code pages are a bit tougher to do since they are designed differently. But I think the above shows that the actual number of best-fit entries varies from code page to code page.

Some of the entries even make sense (e.g. 1256 does not have Arabic digits in it, so those digits are best-fit mapped to ASCII 0 to 9 -- beats question marks any day!).

Other entries are just funny (like the infinity turns to eight thing -- the next time someone tells you its just an eight-hour work day you will know why it seems to take forever!).

But most fall in between -- arguably better than nothing.

Perhaps that what they should have been called (rather than "best fit" mappings) -- "better than nothing fit" mappings. Seems like the real mappings are the ones that are the best fit. :-)

1 - Pretty much all of them other than UTF-7, UTF-8, and GB18030, in fact.

This post sponsored by "Å" and "Æ" (U+00c5 and U+00c6, a.k.a. LATIN CAPITAL LETTER A WITH RING ABOVE and LATIN CAPITAL LETTER AE)
Both of which "better than nothing fit" map to U+0041 (LATIN CAPITAL LETTER A) on code page 1250!

# Avi D on 28 Feb 2009 7:44 PM:

I guess that explains this: http://www.comsecglobal.com/FrameWork/Upload/SQL_Smuggling.pdf (in short, non-quote characters can be best-fit into a quote character, post-validation, leading to SQL Injection in some situations).

Apparently SQL Server calls WideCharToMultiByte with the WC_NO_BEST_FIT_CHARS set, when it shouldnt be. I was curious as to how this occurs, now I know :-). (If you can convince MSRC that this should be changed, I'd be thrilled)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2012/02/20 Where short file names can fail

2008/05/08 In hindsight, they may have BEST FIT these files where the sun never shines

2006/11/16 The problem of string comparisons, WORD sorts, and the minus that is treated like the hyphen

2006/02/14 Every character has a story #18: U+06cc and U+064a (ARABIC LETTER FARSI YEH and ARABIC LETTER YEH)

2006/01/20 Getting the characters in a code page (the code)

2006/01/07 Getting the characters in a code page

2005/10/29 Why an 8-hour day seems to take forever

2005/04/18 A few of the gotchas of WideCharToMultiByte

2005/02/15 ~~Best~~Better than nothing fit mappings, unleashed, #1

go to newer or older post, or back to index or month or day