Is Kana 'alphabetic' ? Depends on who you ask....

by Michael S. Kaplan, published on 2005/09/12 10:20 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/12/463991.aspx

In the microsoft.public.win32.programmer.international newsgroup, Christian Kaiser asked:

Given the appended small program, I can test whether a Unicode character is AlphaNumerical or not.

If I call it using Half-Width Katakana (arg "0xff66" for example), IsCharAlphaNumericW() returns '0' - but according to the Unicode specs (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt), this is a letter:

FF66;HALFWIDTH KATAKANA LETTER WO;Lo;0;L;<narrow> 30F2;;;;N;;;;;

Do I make a mistake, or are the internal Unicode chartype pages in Windows wrong? We have a very important customer who has problems because of this...

System: Window XP SP2, newest patches applied

BTW: GetStringType() does return the correct information:

IsCharAlphaNumeric(0xff66) -> 0
GetStringTypeW(1) -> 0x300
GetStringTypeW(3) -> 0x8050

Strange? I think yes.

Christian

----

#include <windows.h>
#include <tchar.h>
#include <stdio.h>

void main(int arg, char* argv[])
{
    WCHAR n = strtol(argv[1],NULL,0);
    WORD n1 = 0;

    printf("IsCharAlphaNumeric(0x%04x) -> %d\n",n,IsCharAlphaNumericW(n));
    if (GetStringTypeW(CT_CTYPE1,&n,1,&n1))
    printf("GetStringTypeW(1) -> 0x%x\n",n1);
    if (GetStringTypeW(CT_CTYPE3,&n,1,&n1))
    printf("GetStringTypeW(3) -> 0x%x\n",n1);
}

Christian is right about the difference between the NLS function and IsCharAlphaNumeric.

According to the NLS function:

CT_CTYPE1 to GetStringTypeW returns 0x0300 (C1_ALPHA | C1_DEFINED).
CT_CTYPE1 to GetStringTypeW returns 0x8050 (C3_KATAKANA | C3_HALFWIDTH | C3_ALPHA).

so according to NLS, this character is a halfwidth katakana character and it is alphabetic.

However, the logic in IsCharAlphaNumeric explicitly checks to make sure it is either C1_ALPHA or C3_ALPHA and not either C3_KATAKANA or C3_HIRAGANA. So clearly, according to user32.dll neither Hiragana nor Katakana is alphabetic.

Now whether Christian is correct about Unicode's take on the situation is a little less clear -- a general category of Lo (Letter, Other) does not necessarily mean Alphabetic (there is no specific rule as to the meaning of general category via-a-vis a character being alphabetic or not, although Mark Davis of Unicode and others are trying to write up guidelines to map Unicode character data to POSIX style categorizations like Alphabetic for implementations).

So the answer to the question is that it depends on who you ask. Perhaps the best answer is to call GetStringType yourself and decide rather than using the user32.dll Is* function wrappers. Because it seems like every time someone tries to wrap our functions to make it easier, something becomes more complicated....

# Nicholas Allen on 12 Sep 2005 10:57 AM:

That certainly seems like an unhelpful stretching of the meanings of 'alphabet' and 'letter'. New words were created to distinguish the mechanics of various written languages. Now, the same term is being used with all languages. This defeats the original purpose of categorizing things.

# Michael S. Kaplan on 12 Sep 2005 11:24 AM:

Hi Nicholas -- Which part does not seem helpful, exactly?

# Nicholas Allen on 12 Sep 2005 2:06 PM:

I guess I'm just fustrated by the solution to this problem.

We have a relatively new standard that divides symbols up into various categories. We have old standards that divide symbols up among different sets of categories. Each are derived from commonly shared linguistic understanding.

Obviously, things are not going to work out perfectly when it comes to relating the old and new standards. However, instead of bending the standards to fit, people are trying to bend the definitions of the original words.

Now, because of that, I don't know what you mean when say 'letter' or 'alphabet'. That makes it difficult to hold a discussion.

Worse, by settling for having multiple interpretations, I can't actually trust software libraries to do anything for me. Having to decide everything myself is a waste of my time and creates yet another interpretation that people will have to deal with.

# Michael S. Kaplan on 12 Sep 2005 2:35 PM:

The NLS definition gives you the exact info on what it is. And if you call anything in Japanese the "letters", you call the Kana that. So I would call them the alphabet for Japanese.

Note that the main thing I am doing is pointing out the more limited scope in the user32 wrapper. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/11/21 One disadvantage to being supplementary...or Japanese?

2011/05/31 f y cn rd ths, thn cd tht strps yr vwls my nt bther y s mch....

2008/06/29 If they say "it's all relative" then remind them it is not a coincidence that there is a show called Relative Madness on TV

2007/06/19 IsCharSomethingOrOther?

2006/10/20 Complex string mapping

go to newer or older post, or back to index or month or day