Character info from beyond the Grave

by Michael S. Kaplan, published on 2007/10/16 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/16/5467395.aspx


Today I am going to try to clear out some Contact link questions.... 

The Contact link question was:

Heck, the subject could even be "isalnum non-ascii characters"

I was looking at this page:

http://msdn2.microsoft.com/en-us/library/t9zea13t(VS.71).aspx

which tells me that isalnum should be faster than my macro that tests character ranges (c >= 'a' && c <= 'z').

However, I need to allow non-ascii characters, such as Western European, and when I did this:

foo = isalnum('À');

I got a run-time violation while debugging my app.

So, why does isalnum expect an int, rather than unsigned int? If I want to use this function, do I need to go change all my char variables to unsigned char?

And: is there any hope for Unicode? I'm trying to add an auto-completion: type a few letters of a name and the program looks through your address book for a match. I assume this would be useful for Unicode users, also.

Thanks.

If you want to get answers outside of that lowest character range, then the answer is the Unicode version of the function, iswalnum.

This function (and the rest of the Unicode "isw" functions, take a winnt_t value, which per that topic:

The wint_t data type is defined in WCHAR.H as an unsigned short; it can hold any wide character or the wide-character end-of-file (WEOF) value.

These CRT functions are all wrappers around GetStringTypeW, which you can call directly to get character type information....

 

This post brought to you by À (U+00c0, a.k.a. LATIN CAPITAL LETTER A WITH GRAVE)


Geoffrey Coram on 9 Nov 2007 2:15 PM:

Thanks for the response.  On the page you cited, I read:

When used with a debug CRT library, isalnum will display a CRT assert if passed a parameter that is not EOF or in the range of 0 through 0xFF.

The sentence says it's referring to isalnum (not iswalnum), so I think I can be excused for not expecting a problem for 8-bit characters.  In fact, if the prototype had been isalnum(unsigned int), I would have been fine.  But because the prototype is isalnum(int), the call isalnum('À') must be sign-extending the 8-bit number to a 16- or 32-bit (negative) number.

I assume "iswalnum" also handle double-byte characters, since short must be at least 16 bits.  But doesn't Unicode also have 3- and 4-byte characters?  In which case I'd need a different function.

Michael S. Kaplan on 9 Nov 2007 2:34 PM:

This function will not do the trick, even in its Unicode version -- and none of it is aware of four byte characters (there are no three-byte characters in UTF-16).


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day