How does it detect invalid characters?

by Michael S. Kaplan, published on 2005/01/08 14:34 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/08/349230.aspx


A few days ago I mentioned the new compiler error C4819 for C/C++. When I did so, I quoted the meaning of the error:

C4819 occurs when an ANSI source file is compiled on a system with a codepage that cannot represent all characters in the file.

A few people asked me how this was being detected.

There are many ways to do it, but the easiest is to call the MultiByteToWideChar API, using CP_ACP as the CodePage parameter and the MB_ERR_INVALID_CHARS flag.

Any time a byte value that is not part of the legal mapping in the codepage is found, the API will fail with a GetLastError return value of ERROR_NO_UNICODE_TRANSLATION.

It is important to note that this functionality is much more akin to that of the spellchecker in Microsoft Word than the thesaurus, in that it has no chance of detecting byte values that are valid for the code page but that make no sense.

Therefore if one attempts to use the strings L"Ελλάδα" and L"ελληνικά" on a machine with code page 1252 as its default will simply cause the compiler to assume you meant L"ÅëëÜäá" and L"åëëçíéêÜ".

The only time you will see the error is when a byte value is not a valid one, as per the tables listed at the Code Pages Supported by Windows site. Examples are the shaded cell choices, for example 0x8d and 0x8f in code page 1252, 0x8e or 0x90 in code page 1255, or 0x80 in code page 932.

As most of these code page tables are full, it is easily possible to fool the compiler (or more accurately to fool the NLS API; I hate to blame an innocent compiler for an error they could not detect!) into thinking the string is perfectly valid even if it essentially crap like L"éùøàì" / L"òáøéú" for L"ישראל" / "עברית" (cp1255). Or L"ÇáããáßÉ ÇáÚÑÈíÉ ÇáÓÚæÏíÉ" / L"ÇáÚÑÈíÉ" for L"المملكة العربية السعودية" / L"العربية" (cp1256).

And all of the examples I gave assumed you had a machine with a CP_ACP value of 1252. The same problems can be seen in any cross-codepage situation, such as returning L"πσρρκθι" when what was meant was L"русский" (cp1253 rather than cp1251). Or L"¤¤¤å(ÁcÅé)" rather than L"中文(繁體)" (cp1254 rather than cp950).

I could go on but you probably get the point; one would really have to rely on the invalid sequences, and some code pages (like 1252) do not have very many. Saving the file as a Unicode (meaning UTF-16LE) file might be the best way to avoid the potential bugs that come up later with these nonsense strings being propogated to your application.

 

This post brought to you by "€" (U+20ac, a.k.a. EURO SIGN)
Note that U+20ac does not exist on code page 932!


# Dean Harding on 9 Jan 2005 2:54 PM:

I'd say you probably shouldn't be putting these characters in there in the first place. If they're going to be displayed to a user, they should be in a resource file (I can forgive English speaking people for doing it, since we seem to think the whole world should just speak English and be done with all these problems ;). Of course, they could very well be in comments or the names of identifiers, but then it wouldn't really matter what the compiler saw the text as, as long as it see the same thing each time for identifiers (i.e. it'd be a problem if source file a.cpp was in the local code page and source file b.cpp was in Unicode, and both referenced the same non-ASCII identifier.)

# Michael Kaplan on 9 Jan 2005 3:23 PM:

Its definitely not only a problem for english speakers -- I seldom have seen code coming out of Japan or Korea or PRC or Taiwan that did not have some ideographs in the comments -- and non-Unicode files cause them to not behave well on non-CJK systems....

But I agree with you that hardcoded strings are probably a no-no, even though people do them all the time.

# Anonymous on 11 Sep 2005 4:19 AM:

Yesterday, Buck Hodges was talking about how TFS Version Control determines a file's encoding: ...

# handan on 15 Jul 2008 11:25 PM:

ie6,7 display javascriot error

invalid characters.

why?

can you tell me ?

tks!


referenced by

2008/07/16 Google as the window, Windows as the door, the Doors as (never mind, this could take a while)

2008/05/19 Everyone seems averse to the BOM these days; Should we blame TSA? :-)

2007/07/25 What's up with MB_ERR_INVALID_CHARS?

2005/12/09 More on the C4819 error

2005/11/23 100% roundtrip ASCII? 100% roundtrip ANSI?

2005/10/28 It isn't Unicode, it's Double Secret Unicode!

2005/09/11 Working hard to detect code pages

2005/05/07 VB6 isn't using Unicode, most of the time

2005/02/20 Encoding support can be found in the strangest places....

go to newer or older post, or back to index or month or day