Is CharNextExA broken?

by Michael S. Kaplan, published on 2007/04/19 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/04/19/2190207.aspx


Jochen Kalmbach asks over in the Suggestion Box:

Hi Michael!

Short question: Is "CharNextExA" broken in XP (or generally borken)?

It does not recognize UTF8...

Here is a small example:

#include <windows.h>
#include <tchar.h>
#include <stdio.h>
#include <string.h>

#pragma comment(lib, "User32.lib")

size_t StrLenCP(WORD codepage, LPCSTR szString) {
    if (szString == NULL) return 0;

    size_t res = 0;
    LPCSTR p = szString;
    LPCSTR plast;

    do {
        plast = p;
        p = CharNextExA(codepage, p, 0);
        res++;
    } while(p != plast);

    return res;
}

int _tmain() {
    // "I{heart}NY"
    char str[] = {0x49, 0xE2, 0x99, 0xA5, 0x4E, 0x59, 0x00};
    wchar_t *szUnicode = new wchar_t[20];

    MultiByteToWideChar(CP_UTF8, 0, str, -1, szUnicode, 20);
    printf("Characters (UTF16): %d\n", wcslen(szUnicode));
    printf("Characters (UTF8) : %d\n", StrLenCP(CP_UTF8, str));
}

Greetings
Jochen

I could have answered this one without the code sample. :-)

Neither CharNextExA nor CharPrevExA are broken in any version of Windows, but neither one was designed with UTF-8 in mind.

Remember how I talked about the way that even though NLS did not own some of these USER functions, that we pretty much "owned" them since we control their behavior, in this post?

Well, this is one of those functions.

It is completely dependent on the behavior of IsDBCSLeadByteEx, which is an NLS function that is (for obvious reasons) only dealing with East Asian, DBCS code pages.

There is code in IsDBCSLeadByteEx related to UTF-8 -- but that bit of code simply returns FALSE, always.

So the function is behaving as it was designed back when it was ported from Windows 95, and it was only ever designed to handle a specific set of code pages that pre-date support of UTF-8 in Windows.

Now the big question -- would it make sense to add this support? Just like we added C3_HIGHSURROGATE and C3_LOWSURROGATE to the Vista GetStringTypeW function?

And the answer is simple -- the NLS API could. Though the meaning of this function when a character can contain up to four bytes is unclear -- it almost begs for a new function to be added in order to properly support the notion of these four byte characters....

Kind of a "lunch interview" question -- what would be required of NLS to extend support of CharNextExA/CharPrevExA in the next version of Windows?

 

This post brought to you by (U+a1b7, a.k.a. YI SYLLABLE LIT)


# Michiel on 20 Apr 2007 7:13 AM:

First rough solution:

1. IsDBCSLeadByteEx is documented as returning non-zero if it's a lead byte. Make it return the number of characters following for a lead byte. (By definition 1 for DBCS)

2. Update its documentation to say "A lead byte is the first byte of a character sequence in a double-byte character set (DBCS) or multibyte character set (MBCS) for the code page."

3. CharNextExA uses the returned number of characters from IsDBCSLeadByteEx.

This approach takes advantage of the fact that UTF-8 was designed explicitly to determine the number of bytes in a character sequence from its lead byte.

CharPrevExA is harder, as IsDBCSLeadByteEx has to return 0 for both single-byte characters and the non-lead-bytes. And you can't decrement twice without risking to underrun a buffer (the caller must obviously make sure there is one preceding character, but not two). Hence, only one byte is available. If that's a single-byte character, we're done, else we search back until we find the first lead byte. The remaining problem here is determining single-byte characters.


referenced by

2010/11/24 UTF-8 on a platform whose support is overwhelmingly, almost oppressively, UTF-16

2007/04/29 What type to use for code page values

go to newer or older post, or back to index or month or day