Getting the characters in a code page

by Michael S. Kaplan, published on 2006/01/07 10:20 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/07/510411.aspx


In the Suggestion Box, rob asked the following question:

Michael,

As posted over at Raymond Chen's blog. What is the best way to display all the characters in i.e. codepage 932 (Japanese) and other codepage that is supported on Windows (post win2k era). The characters doesn't have to display in any fancy format. I just want the result (characters) store in a vector of string or a string table. Any recommendation? Or resources that you can point me to.

Thanks
Rob

Indeed, over in Raymond Chen's blog, there is an off-topic conversation going on in the comments about how to get the characters in a code page.

You can look at the conversation if you want; you will generally see people taking the wrong approach to the problem (in my opinion). Let's look at three possible ways to do the job:

#1 -- Take all of the lead bytes for the code page via the GetCPInfo function and try each possible trail byte, converting each sequence via MultiByteToWideChar. As the conversation sort of implied, there are many complications with this approach, although it is possible if you carefully check the return values to make sure only one character comes back, etc. Also, since I said I three suggestions, what were the odds that the first one would be my recommendation? :-)

#2 -- Take everything in the Unicode BMP (0x0000 to 0xFFFF) and try to round trip it through WideCharToMultiByte and MultiByteToWideChar; if it round trips and it is a single character, then clearly it is on the code page. This approach is more feasible although it is more work than you really need to do, so I would not recommend this one either.

#3 -- Once again take everything in the Unicode BMP (0x0000 to 0xFFFF), and again use WideCharToMultiByte, but this time make use of the WC_NO_BEST_FIT_CHARS and WC_DEFAULTCHAR flags to make sure that no best fit mappings take place and that you replace anything not in the code page with the default character. Then, by using the lpUsedDefaultChar parameter, you will know whether the character was not in the code page.

The advantagess to #3 over the other two methods are obvious -- you will get every character in the code page and any time it is not valid you will know by directly checking a Boolean flag. For cp932 (the one in the example) and all of the "ANSI" and "OEM" code pages on Windows, there would never be more than two bytes per character so a single two bytes would cover the lpMultiByteStr target buffer (for some of the others the job is a bit harder, but it is unclear whether that is being asked). You could even try the same run a second time without the WC_NO_BEST_FIT_CHARS flag and then compare the two to obtain all of the best fit mappings in the code page. And in short order you would have every character in the code page mapping.

Easy! and perhaps even a good interview question, ignoring the fact that the candidate would have to come in knowing all about the NLS API functions!

Now note that this approach will not get you all of the characters in a language, not only because you can't get the letters in a language easily but also because code pages are really not enough to cover a language.

 

This post brought to you by "οΎ™" (U+ff99, HALFWIDTH KATAKANA LETTER RU)


# Michael Dunn_ on 8 Jan 2006 2:52 AM:

As a variation on #1, how about passing each of the possible lead+trail byte combos to _ismbclegal?

# Mihai on 8 Jan 2006 4:08 AM:

The only problem (with all of the solutions) is that not all code pages can be covered like this. The main exception is GB-18030, which has characters outside BMP.

Sure, one might extend the range beyond BMP, but the performance goes down and the consumed memory goes up.
In this case a variant of #1 might give better results (whith the warning that the complications and care will be even bigger :-)

# Michael S. Kaplan on 8 Jan 2006 9:53 AM:

For both GB18030 and UTF-8, *all* of Unicode is covered, so if it is a characetr, its oin the code page. And that's that. Easy!

#1 is never easier, though -- it is always more work....

# Michael S. Kaplan on 8 Jan 2006 9:54 AM:

Mike, that is still more work by the time you are done, certainly more complicated to implement....

# Mihai on 8 Jan 2006 3:24 PM:

"For both GB18030 and UTF-8, *all* of Unicode is covered"
True. But the proposed solutions #2 and #3 take BMP only.

"#1 is never easier, though -- it is always more work...."
True again :-)
But I was talking about better results in performance and consumed memory.
Sometimes you have to work more for these two.

# Michael S. Kaplan on 8 Jan 2006 3:52 PM:

Ah, you missed my point -- you do not need to do anything for UTF-8 and GB18030 -- no conversion needed at all!

It is all in there.

# Mihai on 9 Jan 2006 3:11 AM:

"Ah, you missed my point"
True :-) But now I get it.

But do you mean GB18030 contains everything that is in Unicode? I thought it is a (big) subset. So it is a bit like 932, only much bigger, but still not covering all of Unicode. Or is it?

# Michael S. Kaplan on 9 Jan 2006 6:56 AM:

GB18030 is completely tied to Unicode as it is defined, and thus everything in Unicode is in GB18030.

# Rob on 9 Jan 2006 3:14 PM:

Michael,

Thanks for the explanation. If I use recommendation #3 how would I distinguish what characters belong to what code page? Would I use the Hex range for determine the code page?

Thanks
Rob

# Michael S. Kaplan on 9 Jan 2006 3:40 PM:

Hi Rob,

For #3 -- check the return value of the WideCharToMultiByte call -- if the conversion succeeds, it is a valid part of the code page you used to convert, and the byte(s) are in the multibyte param. If it fails, then skip to the next character....

# Maurits [MSFT] on 9 Jan 2006 4:29 PM:

This works... until a code page comes out with a supplementary code point in it...

# Michael S. Kaplan on 9 Jan 2006 4:33 PM:

There are no new code pages coming in Windows, sorry! :-)

# Rob on 10 Jan 2006 12:32 PM:

Michael,

Thanks for the info. I'll try #3 and store all the valid code point into a vector of CString and then print them into a file and open them up in IE. Viewing in IE will allow me to see all the valid characters.

Rob

# CK on 19 Jan 2006 2:56 PM:

Hi Michael,

Any code sample(s) for recommendation #3? Thanks in advance if you have any.

Thanks
CK

referenced by

2006/01/20 Getting the characters in a code page (the code)

go to newer or older post, or back to index or month or day