Depending on when/where/who you ask, that character may not be your [c]type

by Michael S. Kaplan, published on 2007/09/18 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/18/4971899.aspx


The customer question was:

The whole story is about saving in XML format foreign symbols – from time-to-time it fails for some of them.
We found a way to bypass it by filtering them out with isprintable function.
The problem was solved, but some of good symbols are gone…

Japanese customers complaining that we are filtering out some of their characters.
They sent us some string containing such chars.

I've prepared short test to see what happens.
Actually second, third and some other character is recognized as not printable, but they – Japanese – say it's perfectly OK…

The sample string in question:

ボーリング工具

Let's look at the GetStringTypeW CT_CTYPE3 values for each character.

In Vista (where the bug does not repro), the values are:

In XP/Server 2003, U+ff9e (HALFWIDTH KATAKANA VOICED SOUND MARK) and U+ff70 (HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK) did not have the C3_ALPHA on them and were therefore not considered alphabetic by the CRT function.

So the problem will not happen in Vista, and it will not happen (if memoy serves) in Win2000 and earlier....
 
Though to be honest I wonder why the CRT would require something to be a letter to consider something printable, it seems strange (being a diacritic should be enough to make something printable, if you ask me). But after the feedback of what the change was actually doing, the NLS change was essentually reverted (in the process of adding all the Unicode 5.0 characters)....

The underlying issue? Well, since NLS made a change which NLS essentially reverted, I guess you can blame NLS. Though I honestly prefer to think of it as a misguided attempt to be more properly descriptive of Unicode properties in the [boneheaded] NLS character property descriptions, which later caved to the realities of the [equally boneheaded] C runtime character categories. :-)

 

This post brought to you by (U+ff9e, a.k.a. HALFWIDTH KATAKANA VOICED SOUND MARK)


# jmdesp on 18 Sep 2007 3:57 PM:

So they should call GetStringType directly instead of isPrintable, and if the letter is C3_DIACRITIC then let it go through even if it's not C3_ALPHA

Or they convince their client to stop using the stupid, retro-compatibility only, half-width forms and they will have no problem with U+30DC KATAKANA LETTER BO or U+30B0 KATAKANA LETTER GU.

# Michael S. Kaplan on 18 Sep 2007 4:27 PM:

Yep, either will work here. :-)

# Mihai on 19 Sep 2007 12:54 PM:

I would say that they have to figure out why "from time-to-time it fails for some of them."

That is the way to address the cause, not the symptom.

# Michael S. Kaplan on 19 Sep 2007 1:01 PM:

The return of isPrintable does not change on one machine, but it does change between platform versions... that was the source of the variation....

# Thomas Yeung on 9 Dec 2008 9:04 AM:

Michael,

Do you mean XP has a known issue of GetStringTypeW whereas pre-XP and Vista doesn't? Where could I find the "official" information of this issue?

What's the workaround to handle those missing characters? Is the following snippet OK?

if ( GetStringTypeExW( LOCALE_NEUTRAL, CT_CTYPE3, &nChar, 1, &wCharType ) )

     return ( ( wCharType & (C3_ALPHA | C3_DIACRITIC ) );

I found C++ APIs istalpha and istalnum of VS 2008 on XP can't tell some Japanese characters are characters. I remember that somebody from Unicode Consortium said some ISV may mis-classify some characters to non-characters following recommendations of ISO TR 10176 (which excludes combining marks based on its own theory of what should be included in identifiers). Are istalpha(c) and istalnum(c) unlucky to be two of them?

# Michael S. Kaplan on 9 Dec 2008 9:34 AM:

No, the CRT issues are completely due to the GetStringTypeW one. It is a problem I have talked about many times before, and it can be directly observed (as you have pointed out), so I'm not sure of the exact benefit of being any more "official" might be in this case. Maybe you can find a KB article somewhere that will talk about it?

And NEVER use GetStringTypeEx, always use GetStringTypeW. I have talked about this before, too.... :-)

# Thomas Yeung on 11 Dec 2008 5:17 AM:

Michael,

Thank you for pointing out some issues you've talked about many times. :-)

How to deal with some characters which GetStringTypeW can't tell a C*_ALPHA? Should I treat them as special cases by hard-code and just let them pass through?

For example,  under my English version of XP, a test of 0x30FC (KATAKANA-HIRAGANA PROLONGED SOUND MARK) got the following result:

 GetStringTypeW(1) -> 0x0200 (C1_DEFINED)

 GetStringTypeW(3) -> 0x0032 (C3_HIRAGANA | C3_KATAKANA | C3_DIACRITIC)

Neither CType 1 nor 3 returns a C*_ALPHA, but I found XP accepts this character in file names.


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day