Getting the characters in a code page

by Michael S. Kaplan, published on 2006/01/07 10:20 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/07/510411.aspx

Indeed, over in Raymond Chen's blog, there is an off-topic conversation going on in the comments about how to get the characters in a code page.

You can look at the conversation if you want; you will generally see people taking the wrong approach to the problem (in my opinion). Let's look at three possible ways to do the job:

#1 -- Take all of the lead bytes for the code page via the GetCPInfo function and try each possible trail byte, converting each sequence via MultiByteToWideChar. As the conversation sort of implied, there are many complications with this approach, although it is possible if you carefully check the return values to make sure only one character comes back, etc. Also, since I said I three suggestions, what were the odds that the first one would be my recommendation? :-)

#2 -- Take everything in the Unicode BMP (0x0000 to 0xFFFF) and try to round trip it through WideCharToMultiByte and MultiByteToWideChar; if it round trips and it is a single character, then clearly it is on the code page. This approach is more feasible although it is more work than you really need to do, so I would not recommend this one either.

#3 -- Once again take everything in the Unicode BMP (0x0000 to 0xFFFF), and again use WideCharToMultiByte, but this time make use of the WC_NO_BEST_FIT_CHARS and WC_DEFAULTCHAR flags to make sure that no best fit mappings take place and that you replace anything not in the code page with the default character. Then, by using the lpUsedDefaultChar parameter, you will know whether the character was not in the code page.

The advantagess to #3 over the other two methods are obvious -- you will get every character in the code page and any time it is not valid you will know by directly checking a Boolean flag. For cp932 (the one in the example) and all of the "ANSI" and "OEM" code pages on Windows, there would never be more than two bytes per character so a single two bytes would cover the lpMultiByteStr target buffer (for some of the others the job is a bit harder, but it is unclear whether that is being asked). You could even try the same run a second time without the WC_NO_BEST_FIT_CHARS flag and then compare the two to obtain all of the best fit mappings in the code page. And in short order you would have every character in the code page mapping.

Easy! and perhaps even a good interview question, ignoring the fact that the candidate would have to come in knowing all about the NLS API functions!

As a variation on #1, how about passing each of the possible lead+trail byte combos to _ismbclegal?

The only problem (with all of the solutions) is that not all code pages can be covered like this. The main exception is GB-18030, which has characters outside BMP.

Sure, one might extend the range beyond BMP, but the performance goes down and the consumed memory goes up.
In this case a variant of #1 might give better results (whith the warning that the complications and care will be even bigger :-)

For both GB18030 and UTF-8, *all* of Unicode is covered, so if it is a characetr, its oin the code page. And that's that. Easy!

#1 is never easier, though -- it is always more work....

Mike, that is still more work by the time you are done, certainly more complicated to implement....

"For both GB18030 and UTF-8, *all* of Unicode is covered"
True. But the proposed solutions #2 and #3 take BMP only.

"#1 is never easier, though -- it is always more work...."
True again :-)
But I was talking about better results in performance and consumed memory.
Sometimes you have to work more for these two.

Ah, you missed my point -- you do not need to do anything for UTF-8 and GB18030 -- no conversion needed at all!

It is all in there.

"Ah, you missed my point"
True :-) But now I get it.

But do you mean GB18030 contains everything that is in Unicode? I thought it is a (big) subset. So it is a bit like 932, only much bigger, but still not covering all of Unicode. Or is it?

GB18030 is completely tied to Unicode as it is defined, and thus everything in Unicode is in GB18030.

Michael,

Thanks for the explanation. If I use recommendation #3 how would I distinguish what characters belong to what code page? Would I use the Hex range for determine the code page?

Thanks
Rob

Hi Rob,

For #3 -- check the return value of the WideCharToMultiByte call -- if the conversion succeeds, it is a valid part of the code page you used to convert, and the byte(s) are in the multibyte param. If it fails, then skip to the next character....

This works... until a code page comes out with a supplementary code point in it...

Michael,

Thanks for the info. I'll try #3 and store all the valid code point into a vector of CString and then print them into a file and open them up in IE. Viewing in IE will allow me to see all the valid characters.

Rob

Hi Michael,

Any code sample(s) for recommendation #3? Thanks in advance if you have any.

Thanks
CK

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.