A few of the gotchas of MultiByteToWideChar

by Michael S. Kaplan, published on 2005/04/19 04:30 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/19/409566.aspx


Like I mentioned yesterday, I have talked a bunch of times about the way that different forms of strings that are canonically equivalent according to Unicode and which actually look identical visually exist in the world.

Yesterday, I mentioned it while I was talking about a few of the gotchas of WideCharToMultiByte. Today I thought I would talk about the other direction, the MultiByteToWideChar API.

First of all, almost all code pages are in Normalizaton Form C (a.k.a. precomposed) at all times (I will talk about the exceptions in a second). Of course Unicode (by which I mean UTF-16 Little Endian, which Microsoft always calls Unicode) can be either Form C (a.k.a. precomposed) or Form D (a.k.a. composite).

If you would like to choose, then you get that option; you can pass either the MB_PRECOMPOSED or MB_COMPOSITE flags. For the reasons of having data that is consistent with the rest of the platform, I would recommend the MB_PRECOMPOSED flag, but either one is legal (just not both).

There is also an MB_USEGLYPHCHARS flag. Now I already beat that particular horse to death when I answered the question what the &%#$ does MB_USEGLYPHCHARS do? So if you want to know more you can look there. You probably do not, at least I hope you do not....

Finally, there is the MB_ERR_INVALID_CHARS flag. The documentation says it all on this flag:

If the function encounters an invalid input character, it fails and GetLastError returns ERROR_NO_UNICODE_TRANSLATION.

Now after the MultiByteToWideChar topic covers these four flags, it gets confusing. It says:

For the code pages in the following table, dwFlags must be zero, otherwise the function fails with ERROR_INVALID_FLAGS.

50220
50221
50222
50225
50227
50229
52936
54936
57002 through 57011
65000 (UTF7)
65001 (UTF8)
42 (Symbol)

Windows XP and later: MB_ERR_INVALID_CHARS is the only dwFlags value supported by Code page 65001 (UTF-8).

Call me crazy, but there probably was not a need to have the sentence before the table and the table conflict with the sentence after the table. It is kind of understandble, but as topics go it has the flavor of a WTF sentence, if you ask me!

It does end on a better note by defining what an invalid character is:

The function fails if MB_ERR_INVALID_CHARS is set and encounters an invalid character in the source string. An invalid character is either, a) a character that is not the default character in the source string but translates to the default character when MB_ERR_INVALID_CHARS is not set, or b) for DBCS strings, a character which has a lead byte but no valid trailing byte. When an invalid character is found, and MB_ERR_INVALID_CHARS is set, the function returns 0 and sets GetLastError with the error ERROR_NO_UNICODE_TRANSLATION.

Oh, and before that it talks about some security considerations (more on these another day).

I am forgetting something now. What was it?

Oh yeah, I was going to talk about the code pages that are not Normalization Form C.

Obviously there is UTF-7 (65000), UTF-8 (65001), and GB-18030 (54936). Since each of these code pages covers the entire Unicode repetoire, each can have characters in Unicode normalization Form C, Form D, or any combination thereof. Some of the other code pages in the table above also fall into this category, but in the case of these three and all the rest, the MB_PRECOMPOSED and MB_COMPOSITE flags are both at best ignored and at worst will cause an ERROR_INVALID_FLAGS to be returned. So you will want to not pass either flag with any of them.

But there is one code page that can have data in either composite or precomposed form -- it is the Vietnamese ACP, code page 1258. It has all of the following entries:

CC = U+0300 : COMBINING GRAVE ACCENT
D2 = U+0309 : COMBINING HOOK ABOVE
DE = U+0303 : COMBINING TILDE
EC = U+0301 : COMBINING ACUTE ACCENT
F2 = U+0323 : COMBINING DOT BELOW

The reason for doing this is that there was really not enough room in the code page, otherwise. Unfortunately, there are also some precomposed characters with these accents:

C0 = U+00C0 : LATIN CAPITAL LETTER A WITH GRAVE
C1 = U+00C1 : LATIN CAPITAL LETTER A WITH ACUTE
C8 = U+00C8 : LATIN CAPITAL LETTER E WITH GRAVE
C9 = U+00C9 : LATIN CAPITAL LETTER E WITH ACUTE
CD = U+00CD : LATIN CAPITAL LETTER I WITH ACUTE
D1 = U+00D1 : LATIN CAPITAL LETTER N WITH TILDE
D3 = U+00D3 : LATIN CAPITAL LETTER O WITH ACUTE
D9 = U+00D9 : LATIN CAPITAL LETTER U WITH GRAVE
DA = U+00DA : LATIN CAPITAL LETTER U WITH ACUTE
E0 = U+00E0 : LATIN SMALL LETTER A WITH GRAVE
E1 = U+00E1 : LATIN SMALL LETTER A WITH ACUTE
E8 = U+00E8 : LATIN SMALL LETTER E WITH GRAVE
E9 = U+00E9 : LATIN SMALL LETTER E WITH ACUTE
ED = U+00ED : LATIN SMALL LETTER I WITH ACUTE
F1 = U+00F1 : LATIN SMALL LETTER N WITH TILDE
F3 = U+00F3 : LATIN SMALL LETTER O WITH ACUTE
F9 = U+00F9 : LATIN SMALL LETTER U WITH GRAVE
FA = U+00FA : LATIN SMALL LETTER U WITH ACUTE

So you it looks like maybe you could have mixed "Form C" and "Form D" code page 1258 text, doesn't it?

Unfortunately, its not that perfect. There are two error patterns, marked below in RED:

0xc0 with MultiByteToWideChar/MB_PRECOMPOSED --> U+00c0
0xc0 with MultiByteToWideChar/MB_COMPOSITE --> U+0041 U+0300
0x41 0xcc with MultiByteToWideChar/MB_PRECOMPOSED --> U+0041 U+0300
0x41 0xcc with MultiByteToWideChar/MB_COMPOSITE --> U+0041 U+0300

and going the other way:

U+00c0 with WideCharToMultiByte/WC_COMPOSITECHECK --> 0xc0
U+00c0 with WideCharToMultiByte --> 0x41 0xcc
U+0041 U+0300 with WideCharToMultiByte/WC_COMPOSITECHECK --> 0xc0
U+0041 U+0300 with WideCharToMultiByte --> 0xc0

The pattern is clear, right? MultiByteToWideChar is not quite smart enough to precompose in Unicode what is composite in cp1258, and WideCharToMultiByte is not quite smart enough to keep composite what is composite in Unicode.

Ah well, nothing is perfect -- the Vietnamese code page is missing some characters used in Vietnamese, anyway.

But the real reason for these combining characters is to handle the many letters used in Vietnamese that have double diacritics on them -- the cases of dual representations are somewhat accidental, all things considered, in the face of the need to support letters like "ẳằẵắặầẩẫấậ" and so forth....

 

This post brought to you by "À" (U+00c0, a.k.a. LATIN CAPITAL LETTER A WITH GRAVE)


# Lionel Fourquaux on 19 Apr 2005 7:54 AM:

"Some of the other code pages in the table above also fall into this category, but in the case of these three and all the rest, the MB_PRECOMPOSED and MB_COMPOSITE flags are both at best ignored and at worst will cause an ERROR_INVALID_FLAGS to be returned. So you will want to not pass either flag with any of them."

I've been wondering since I read the documentation for MultiByteToWideChar, whether there is some special reason for this limitation. I think it would be useful to be able to convert from an arbitrary code page to a given unicode normalization form using one API call. Am I missing some problem?

Another drawback of MultiByteToWideChar and WideCharToMultiByte you don't speak about is that they are not designed for streaming conversions (for huge text documents).

# Jochen Kalmbach on 1 Jul 2005 3:48 AM:

Just a small addition:
The "MB_ERR_INVALID_CHARS" flag for UTF-8 if valid for Windows 2000 SP4 and later!

In Windows XP there is a bug in MultiByteToWideChar.
Normally I would assume that if I call this function with cchWideChar=0 it will return the number of wchars required. And a second call (with the same parameters and the requested buffer-size) should succeed.
But the following example does fail:
DWORD gle;
const char sz[]="A\xC2"; // the '\xC2' is a UTF8-lead-byte
int iL1=MultiByteToWideChar(CP_UTF8, 0, sz, strlen(sz), NULL, 0);
WCHAR* wsz=new WCHAR[iL1];
assert(iL1==strlen(sz)-1);
int iL2=MultiByteToWideChar(CP_UTF8, 0, sz, strlen(sz), wsz, iL1);
if (iL2 == 0)
gle = GetLastError();
assert(iL1==iL2);
delete[] wsz;


It works correctly on Windows 2000 and 2003

# kurakuraninja on 27 Aug 2005 10:57 PM:

Back in May of 2004, Quan Nguyen sent a message to Dr. International about Vietnamese collation...

# Bilal on 20 Jun 2008 3:18 AM:

my multi-byte string contains NULL characters 0x00 within it.

It appears that MultiByteToWideChar does not work on the FULL string, it stops as soon as it encounters the first NULL byte in the multi-byte string, although i've passed the full length of the multi-byte buffer to MultiByteToWideChar in the fourth parameter.

ny clues???

# Michael S. Kaplan on 20 Jun 2008 3:32 AM:

What code page are you using? And what string, exactly?

# Bilal on 20 Jun 2008 4:07 AM:

code page 932.

the multi-byte string is basically from a file, that i've read in a char buffer.

actually i'm detecting the code page for the data from that file, by implementing the technique "Detecting a String's Character Set" given at the following URL:

http://www.microsoft.com/globaldev/DrIntl/columns/019/default.mspx

# Bilal on 20 Jun 2008 5:50 AM:

There is a file on the disk, that I read in a char buffer using MFC's CFile class. Now this file has some NULL bytes in it.

I then pass this buffer to MultiByteToWideChar.

I'm actually trying to detect the code page for the text in the file using the technique described under heading "Detecting a String's Character Set" at the following URL:

http://www.microsoft.com/globaldev/DrIntl/columns/019/default.mspx

# Bilal on 23 Jun 2008 12:48 AM:

Any clues what I should do?

# Michael S. Kaplan on 23 Jun 2008 1:29 AM:

Strictly speaking, if there are embedded NULLs then it is not a cpg932 file. :-)

But I have it on my list of things to try out when I have a moment (since you didn't provide specific information to build the file for the repro like I asked or specific code you used for the MultiByteToWideChar call, this might take longer since it will have to wait until I have some real time to try it out, just in case).

In the meantime, best thing might be to stick to valid files...

# Bilal on 23 Jun 2008 4:32 AM:

I'm thankful that you are taking out time to help me out.

I can send you the file; if you open the file in MS Word, it identifies it's encoding as SHIFT JIS. If you open the file in some Hex Editor, it will show you the NULLs in it. For that i need some email address of yours.

I'm pasting the code for file reading and the round trip conversion.

// **** Conversion from multi-byte to Unicode and back **** //

char *pBuffer;

ULONG ulStreamLength;

ReadFileInBuffer( pBuffer, ulStreamLength );  //Beware! Template fn

WCHAR *szUnicodeString = new WCHAR[ ulStreamLength + 1 ];

char* szANSI2 = new char[ ulStreamLength + 1 ];

for( iIndex = 0; iIndex < gICodePagesArray.GetCount(); iIndex++ )

{

m_iFileEncodingFormat = nextCodePage from an array;

wmemset( szUnicodeString, 0, ulStreamLength + 1);

memset( szANSI2, 0, ulStreamLength + 1 );

MultiByteToWideChar( m_iFileEncodingFormat, 0, pBuffer, ulStreamLength+1, szUnicodeString, ulStreamLength + 1 );

WideCharToMultiByte( m_iFileEncodingFormat, 0, szUnicodeString, ulStreamLength + 1, szANSI2, ulStreamLength + 1, NULL, NULL );

if( memcmp(pBuffer, szANSI2, ulStreamLength) == 0 )

     {

bDefaultFound = TRUE;

break;

     }

}

// **** File Reading **** //

template <class T> void

ECR::ReadFileInBuffer( T* &pBuffer, ULONG& ulStreamLength )

{

CFile tempFile;

INT nBytesRead = 0;

if( !tempFile.Open( m_strFilePath, CFile::modeRead | CFile::shareDenyWrite) )

// throw exception

ulStreamLength = tempFile.GetLength();

pBuffer = new T[ ulStreamLength + 1 ];

memset( pBuffer, 0, ulStreamLength + 1 );

nBytesRead = tempFile.Read( pBuffer, (UINT)ulStreamLength );

tempFile.Close();

//end reading file closed

pBuffer[ulStreamLength] = 0;//insert NULL at last position

}

# Bilal on 23 Jun 2008 4:59 AM:

my bad, the MultiByteToWideChar and WideCharToMultiByte work correctly, there is a bug in my code.

please ignore my last post!

sorry for the inconvenience


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/01/12 On my "Vietnamese Plus" and "pseudo-Form V" constructs

2008/12/15 Frost's The Form Not Taken

2008/09/25 When to make a change, when to stay the same

2008/09/14 Johab to be kidding me!

2007/10/29 Microsoft is a Form 'C' shop, Part 1

2007/08/30 The main criteria in determing whether a code page sucks? Suckage, of course!

2007/07/25 What's up with MB_ERR_INVALID_CHARS?

2007/06/26 The MB_PRECOMPOSED flag is stupid, and the MB_COMPOSITE ain't no genius either

2007/04/19 Search and ye shall find, SIAO style!

2006/07/17 'A' and 'W' are sometimes living in two different worlds

2006/04/22 Dial 911, code page 864 isn't breathing

2005/11/11 What to do with the Vietnamese keyboard on Windows?

2005/09/12 You probably don't want to use Microsoft's code page 21027

2005/08/27 Vietnamese is a complex language on Windows

2005/04/20 Encoding APIs and Security Concerns, APIs and Security Decisions

go to newer or older post, or back to index or month or day