Getting the characters in a code page (the code)

by Michael S. Kaplan, published on 2006/01/20 00:01 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2006/01/20/515238.aspx


Just recently I posted about Getting the Characters in a Code Page, and described what I thought was the best solution:

#3 -- Once again take everything in the Unicode BMP (0x0000 to 0xFFFF), and again use WideCharToMultiByte, but this time make use of the WC_NO_BEST_FIT_CHARS and WC_DEFAULTCHAR flags to make sure that no best fit mappings take place and that you replace anything not in the code page with the default character. Then, by using the lpUsedDefaultChar parameter, you will know whether the character was not in the code page.

But someone named CK was looking more information on how to implement this, asking for a code sample in a comment (and then in the Suggestion Box 34 minutes later!). :-)

Here is the sort of thing I had in mind:

#define _UNICODE
#define UNICODE
#include <stdio.h>
#include <windows.h>

void main()
{
    BOOL fDefaultChar = false;
    WCHAR ucp;
    char ch[2];
    int cch = 0;

    // You could use "for(ucp = 0x0000; ucp < 0xffff; ++ucp)"
    // to include the NULL if you would like to....
    for(ucp = 0xffff; ucp > 0x0000; ucp--)
    {
        int cb;

        ch[0] = 0;
        ch[1] = 0;

        cb = WideCharToMultiByte(
            932,
            WC_NO_BEST_FIT_CHARS | WC_COMPOSITECHECK | WC_DEFAULTCHAR,
            &ucp,
            1,
            ch,
            2,
            NULL,
            &fDefaultChar);

        if(cb > 0 && !fDefaultChar)
        {
            // This code point is on the code page, so do something with it.
            cch++;
            wprintf(L"U+%04x\n", ucp);
        }
    }

    wprintf(L"\nA total of %d code points in the code page.", cch);
}

Obviously you can replace those wprintf calls with whatever you wanted do with the characters on the code page, and replace the code page value with whatever ANSI or OEM code page you wanted to use....

 

This post brogught to you by "U" (U+0055, a.k.a. LATIN CAPITAL LETTER U)


# Nick Lamb on Friday, January 20, 2006 6:34 AM:

Try inverting the loop? U+FFFF is not a character, but U+0000 NULL probably makes an appearance in a lot of character sets.

for(ucp = 0x0000; ucp < 0xffff; ++ucp)

# Michael S. Kaplan on Friday, January 20, 2006 10:17 AM:

Yes, that would work too (though NULL is always NULL on all of the ACPs and OEM CPs on Windows, so you cn scratch the "a lot of" and replace with "all of"). :-)

But since the main goal was a sample showing how you would use WideCharToMultiByte to find the valid characters (a piece which is important even in non-contrived scenarios!) it is probably good enough for a sample.

My main interest in this case (since many computer languages do not support both prefix and postfix operators) was to avoid adding that particular confusion here. :-)

# Michael S. Kaplan on Friday, January 20, 2006 10:24 AM:

Ok, taken in as as friendly amendment in a comment, Nick. :-)

# Maurits on Friday, January 20, 2006 11:18 AM:

OK, so I can do the loop and get U+FFFF but not U+0000...
Or I can do the loop and get U+0000 but not U+FFFF...

Any way to get both? How about

ucp = 0;
do {
...
} while (++ucp)

# Michael S. Kaplan on Friday, January 20, 2006 11:20 AM:

Well, since 0xFFFF is not a code point ever, and since 0x0000 is always one but never has anything to do with the langugae/script of the code page, it is more of an academic exercise in both cases, right? :-)

# Maurits on Friday, January 20, 2006 12:19 PM:

Yes, it's academic in this particular case.

But as a general question, it has practical value:
"How do you iterate over the entire range of values of an integer type?"

# Michael S. Kaplan on Friday, January 20, 2006 1:02 PM:

Actually, integer types are easier than the uint ones. :-)

# CK on Friday, January 20, 2006 1:04 PM:

Michael,

Thanks for the example. Your blogs are useful.

CK

# Maurits on Friday, January 20, 2006 2:11 PM:

Sure, you could do something like

for(i = 0; i >= 0; i++)

to get all the /non-negative/ values of a signed integer type. (Or start at 1 for just the positive)

But it's not so easy to get ALL the values (positive, zero, and negative) of a signed integer type... you have to do something crazy like

int i = INT_MIN;
do { /* stuff */ } while (++i != INT_MIN);

# Maurits on Friday, January 20, 2006 2:18 PM:

Or perhaps this is more readable:

int i = INT_MIN;
do { /* stuff */ } while (i++ != INT_MAX);

But still not perfect for systems that error on overflow... :(

# David on Friday, January 20, 2006 7:20 PM:

I believe you forgot about surrogate pairs, an unfortunate dark chapter in the UTF-16LE saga.

Surrogate pairs are the dirty hack in the spec to deal with the fact that 2 bytes just doesn't cut it. Surrogate pairs start with one byte-pair 0xd800-0xdbff followed by another from 0xdc00-0xdfff. I'm not sure what happens if you pass HALF of a surrogate pair to WidecharToMultiByte but my guess is you'll always get a bad result for values 0xd800-0xdfff.

# Michael S. Kaplan on Friday, January 20, 2006 7:29 PM:

Hi David,

I did not forget about them. But they not apply to any Windows OEM or ANSI code page. Nothingbad happens other than the fact that you get no mapping on the code page.

Though you could optimjize the code bt skipping that range (since it would be skipped anyway) it is probably not worth it.... :-)

# bmm6o on Friday, January 20, 2006 7:31 PM:

Maurits:

To write that loop, I would probably index with a larger data type (e.g. int) and cast to the smaller one where necessary. The loop would then look more like every other loop.

# Nick Lamb on Saturday, January 21, 2006 10:03 AM:

"I'm not sure what happens if you pass HALF of a surrogate pair to WidecharToMultiByte but my guess is you'll always get a bad result for values 0xd800-0xdfff."

For e.g. UTF-8 you get illegal results from all shipping versions of Windows. As you'll see in the documentation Microsoft has tried (again) to fix this in Vista. I no longer have a Vista test system so I can't testify as to its correctness.

Michael rightly points out that for this particular application (getting the list of characters in a specific legacy code page) that particular bug doesn't do much harm. Of course plenty of other bugs have survived in this family of APIs for many years, so I also wouldn't altogether trust it...

Does Windows provide iconv() or an analagous streaming conversion API ?

# Michael S. Kaplan on Saturday, January 21, 2006 1:23 PM:

Actually, the UTF-8 definition on Windows has been tightening up on every successive version of Windows, as the Unicode definition has tightened up. Illegal sequences will actually be dropped completely (and silently) by default and will error out if you ask for errors to stop the conversion via flag....

# Nick Lamb on Saturday, January 21, 2006 2:15 PM:

Michael, we're talking about WideCharToMultiByte not MultiByteToWideChar in this thread. The conversion from UTF-8 has been improving (though still not compliant so far as I can see) but in XP the conversion from UTF-16 to UTF-8 still accepts lone surrogates and outputs illegal UTF-8 sequences.

The reason for mentioning iconv() is that WideCharToMultiByte doesn't have any state. A naive programmer might easily take a UTF-16 disk file, and attempt to read it in a block at time, passing the block to WideCharToMultiByte and sending the UTF-8 to a new disk file, over a network socket or whatever. The WCTMB API makes it extremely difficult to get this right but MSDN neither recommends an alternative nor warns of the danger.

# Michael S. Kaplan on Saturday, January 21, 2006 3:05 PM:

Yes, Nick -- but it the fact is that both are improved from that in server 2003 and even further in Vista. We impove, and defend users against bad data such as lone surrogates (which are actually a bug from whoever inserted the bad data, not us).

If a user is chunking text to WideCharToMultiByte they will have problems, though that is likely why both MLang and .NET necoding methods do have more stateful mechanisms than the low-level NLS API.

In the meantime, avoiding bad data is the best way at all times, so stay away from questionable data sources and you won't have any problems. :-)

# Michael S. Kaplan on Saturday, January 21, 2006 3:07 PM:

ALso, note that Vista has added a WC_ERR_INVALID_CHARS flag for WideCharToMultiByte. We do keep getting better....

# Mihai on Monday, January 23, 2006 4:59 AM:

"But they not apply to any Windows OEM or ANSI code page."

But the title is "Getting the characters in a code page," not "Getting the characters in an ANSI code page."

To really get a complete code page, you can try iterating from 0 to 0x10FFFF, taking care to use surogates for everything above BMP.

# Michael S. Kaplan on Monday, January 23, 2006 10:05 AM:

To be honest, this is only needed in two cases (GB18030 and UTF-8), both of which cover the whole range and are thus not really needed. ZNo other code page on Windows supports supplementary characters....

# asdf on Friday, March 17, 2006 3:24 AM:

It's really easy to do loops over an inclusive range without overflow:

for (bool go = true; go; (go = (f != l)) && (++f, true))
 stuff;

or

do {
  stuff;
} while ((f != l) && (++f, true));

or wrap it up in a macro:

#define inclusive_for(init, cond, inc) \
  if (bool ar3_d0n3_ = false) {} \
  else for (init; !ar3_d0n3_; (ar3_d0n3_ = !(bool)(cond)) || ((inc), false))

inclusive_for (int i = INT_MIN; i != INT_MAX; ++i)
  cout << i;

And yes the macro has to be written that way to protect against:

if (0) {
} else if (0)
  inclusive_for (uintmax_t i = 0, i != UINTMAX_MAX, ++i)
     cout << i;
else {
   // not what you expect, without the if else thing
}

referenced by

2006/04/22 Dial 911, code page 864 isn't breathing

go to newer or older post, or back to index or month or day