by Michael S. Kaplan, published on 2005/01/08 14:34 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/08/349230.aspx
A few days ago I mentioned the new compiler error C4819 for C/C++. When I did so, I quoted the meaning of the error:
C4819 occurs when an ANSI source file is compiled on a system with a codepage that cannot represent all characters in the file.
A few people asked me how this was being detected.
There are many ways to do it, but the easiest is to call the MultiByteToWideChar API, using CP_ACP as the CodePage parameter and the MB_ERR_INVALID_CHARS flag.
Any time a byte value that is not part of the legal mapping in the codepage is found, the API will fail with a GetLastError return value of ERROR_NO_UNICODE_TRANSLATION.
It is important to note that this functionality is much more akin to that of the spellchecker in Microsoft Word than the thesaurus, in that it has no chance of detecting byte values that are valid for the code page but that make no sense.
Therefore if one attempts to use the strings L"Ελλάδα" and L"ελληνικά" on a machine with code page 1252 as its default will simply cause the compiler to assume you meant L"ÅëëÜäá" and L"åëëçíéêÜ".
The only time you will see the error is when a byte value is not a valid one, as per the tables listed at the Code Pages Supported by Windows site. Examples are the shaded cell choices, for example 0x8d and 0x8f in code page 1252, 0x8e or 0x90 in code page 1255, or 0x80 in code page 932.
As most of these code page tables are full, it is easily possible to fool the compiler (or more accurately to fool the NLS API; I hate to blame an innocent compiler for an error they could not detect!) into thinking the string is perfectly valid even if it essentially crap like L"éùøàì" / L"òáøéú" for L"ישראל" / "עברית" (cp1255). Or L"ÇáããáßÉ ÇáÚÑÈíÉ ÇáÓÚæÏíÉ" / L"ÇáÚÑÈíÉ" for L"المملكة العربية السعودية" / L"العربية" (cp1256).
And all of the examples I gave assumed you had a machine with a CP_ACP value of 1252. The same problems can be seen in any cross-codepage situation, such as returning L"πσρρκθι" when what was meant was L"русский" (cp1253 rather than cp1251). Or L"¤¤¤å(ÁcÅé)" rather than L"中文(繁體)" (cp1254 rather than cp950).
I could go on but you probably get the point; one would really have to rely on the invalid sequences, and some code pages (like 1252) do not have very many. Saving the file as a Unicode (meaning UTF-16LE) file might be the best way to avoid the potential bugs that come up later with these nonsense strings being propogated to your application.
This post brought to you by "€" (U+20ac, a.k.a. EURO SIGN)
Note that U+20ac does not exist on code page 932!
# Dean Harding on 9 Jan 2005 2:54 PM:
# Michael Kaplan on 9 Jan 2005 3:23 PM:
# Anonymous on 11 Sep 2005 4:19 AM:
# handan on 15 Jul 2008 11:25 PM:
ie6,7 display javascriot error
invalid characters.
why?
can you tell me ?
tks!
referenced by
2008/07/16 Google as the window, Windows as the door, the Doors as (never mind, this could take a while)
2008/05/19 Everyone seems averse to the BOM these days; Should we blame TSA? :-)
2007/07/25 What's up with MB_ERR_INVALID_CHARS?
2005/12/09 More on the C4819 error
2005/11/23 100% roundtrip ASCII? 100% roundtrip ANSI?
2005/10/28 It isn't Unicode, it's Double Secret Unicode!
2005/09/11 Working hard to detect code pages
2005/05/07 VB6 isn't using Unicode, most of the time
2005/02/20 Encoding support can be found in the strangest places....