One disadvantage to being supplementary...or Japanese?

by Michael S. Kaplan, published on 2011/11/21 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/11/21/10239128.aspx

Meanwhile, over on stackoverflow, there was a recent thread, started by DeadMG:

I'm looking at the IsCharAlphaNumeric Windows API function. As it only takes a single TCHAR, it obviously can't make any decisions about surrogate pairs for UTF16 content. Does that mean that there are no alphanumeric characters that are surrogate pairs?

Unfortunately, neither the IsChar* functions in USER32.DLL nor the NLS GetStringTypeW function underneath them can handle supplementary characters. There is no Win32 way to get the info.

You can use managed code, and the CharUnicodeInfo class I first mentioned in A little bit about the new CharUnicodeInfo class:

Note that every one of these methods has two overrides -- one that accepts a single System.Char, and the other which takes a System.String and an index value. The latter case is for dealing with supplementary characters, which are made up of a high and low surrogate (also known as a surrogate pair).

Unfortunately, even functions like GetStringTypeW (which takes whole strings and could in theory return info about surrogate pairs), don't handle them.

Back in 2005 I wrote a speclet (what people in Windows today would call a "one pager"), that did two things:

Add a new CT_CTYPE4 character type that was literally base on Unicode general category (rather than the crazy Perl script used to define the current CT_CTYPE* values based on Unicode), and
Define a special flag to change the current WCHAR (UCS-2) based processing to one that would properly recognize surrogate pairs (UTF-16) to return properties based on supplementary characters when they were there.

I even had a prototyped version of this change, which wasn't actually accepted for Longhorn/Vista and wasn't picked up for Windows 7.

In fact, it isn't in Windows 8, either....

Win32 simply refuses to see beyond the BMP.

Raymond Chen asked me a somewhat related question that occurred to him when he was thinking about all of this:

Why does IsCharAlphanumeric check for C3_KATAKANA|C3_HIRAGANA and explicitly exclude them? In other words is it

* Katakana and Hiragana characters are genuinely alphabetic, but IsCharAlphanumeric wants to reject them because <obscure reason>, -or-
* Katakana and Hiragana characters are not genuinely alphabetic, but for <obscure reason>, they are reported as C1_ALPHA, so IsCharAlphanumeric needs to filter them out.

From looking at http://www.fileformat.info/info/unicode/char/30d8/index.htm it appears that Katakana and Hiragana (or at least character 30d8) are considered Letters by Unicode. I.e., we are in case 1 above. So what are the <obscure reasons> that IsCharAlphanumeric wants to reject them?

This weird "Kana isn't alphabetic" approach is something I previously talked about in IsCharSomethingOrOther? and Is Kana 'alphabetic' ? Depends on who you ask....

Summary; there is no good reason; just some random person who was burned taking Japanese lessons nearly two decades ago and decided to take it out on everyone else....

Maybe he flunked out of the class or something.

Luckily, GetStringTypeW gives the right answer here, for Japanese at least. Though not so much for supplementary characters (including the 200-odd Extension B ideographs in JIS X 0213.

Looking ahead to Windows 8 and modern, the CharUnicodeInfo class should be available in the sandbox, which covers the future, at least.

But for native code, we really don't give a solution.

CT_CTYPE4 FTW? :-)

Brendan Elliott on 25 Nov 2011 9:41 AM:

Yeah, it makes no sense linguistically for Hiragana & Katakana be exclused by IsCharAlphanumeric...

Michael S. Kaplan on 25 Nov 2011 2:26 PM:

I don't suppose your team would be interested in picking up my speclet, Brendan? :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day