by Michael S. Kaplan, published on 2006/01/07 10:20 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/07/510411.aspx
In the Suggestion Box, rob asked the following question:
Michael,
As posted over at Raymond Chen's blog. What is the best way to display all the characters in i.e. codepage 932 (Japanese) and other codepage that is supported on Windows (post win2k era). The characters doesn't have to display in any fancy format. I just want the result (characters) store in a vector of string or a string table. Any recommendation? Or resources that you can point me to.
Thanks
Rob
Indeed, over in Raymond Chen's blog, there is an off-topic conversation going on in the comments about how to get the characters in a code page.
You can look at the conversation if you want; you will generally see people taking the wrong approach to the problem (in my opinion). Let's look at three possible ways to do the job:
#1 -- Take all of the lead bytes for the code page via the GetCPInfo function and try each possible trail byte, converting each sequence via MultiByteToWideChar. As the conversation sort of implied, there are many complications with this approach, although it is possible if you carefully check the return values to make sure only one character comes back, etc. Also, since I said I three suggestions, what were the odds that the first one would be my recommendation? :-)
#2 -- Take everything in the Unicode BMP (0x0000 to 0xFFFF) and try to round trip it through WideCharToMultiByte and MultiByteToWideChar; if it round trips and it is a single character, then clearly it is on the code page. This approach is more feasible although it is more work than you really need to do, so I would not recommend this one either.
#3 -- Once again take everything in the Unicode BMP (0x0000 to 0xFFFF), and again use WideCharToMultiByte, but this time make use of the WC_NO_BEST_FIT_CHARS and WC_DEFAULTCHAR flags to make sure that no best fit mappings take place and that you replace anything not in the code page with the default character. Then, by using the lpUsedDefaultChar parameter, you will know whether the character was not in the code page.
The advantagess to #3 over the other two methods are obvious -- you will get every character in the code page and any time it is not valid you will know by directly checking a Boolean flag. For cp932 (the one in the example) and all of the "ANSI" and "OEM" code pages on Windows, there would never be more than two bytes per character so a single two bytes would cover the lpMultiByteStr target buffer (for some of the others the job is a bit harder, but it is unclear whether that is being asked). You could even try the same run a second time without the WC_NO_BEST_FIT_CHARS flag and then compare the two to obtain all of the best fit mappings in the code page. And in short order you would have every character in the code page mapping.
Easy! and perhaps even a good interview question, ignoring the fact that the candidate would have to come in knowing all about the NLS API functions!
Now note that this approach will not get you all of the characters in a language, not only because you can't get the letters in a language easily but also because code pages are really not enough to cover a language.
This post brought to you by "ル" (U+ff99, HALFWIDTH KATAKANA LETTER RU)
# Michael Dunn_ on 8 Jan 2006 2:52 AM:
# Mihai on 8 Jan 2006 4:08 AM:
# Michael S. Kaplan on 8 Jan 2006 9:53 AM:
# Michael S. Kaplan on 8 Jan 2006 9:54 AM:
# Mihai on 8 Jan 2006 3:24 PM:
# Michael S. Kaplan on 8 Jan 2006 3:52 PM:
# Mihai on 9 Jan 2006 3:11 AM:
# Michael S. Kaplan on 9 Jan 2006 6:56 AM:
# Rob on 9 Jan 2006 3:14 PM:
# Michael S. Kaplan on 9 Jan 2006 3:40 PM:
# Maurits [MSFT] on 9 Jan 2006 4:29 PM:
# Michael S. Kaplan on 9 Jan 2006 4:33 PM:
# Rob on 10 Jan 2006 12:32 PM:
# CK on 19 Jan 2006 2:56 PM:
referenced by