Giving a character a new identity (by giving it some secondary weight)

by Michael S. Kaplan, published on 2007/02/17 17:00 -05:00, original URI:

Hatter Jiang asks over on the MSDN Forums:

I found a bug when I am programing:
when I use like this:


is returns -1,when it should return 9;


is OK.

and belows all worked well:


My environment is :
Windows XP SP2(English Edition + Mutil-Language Pack) + .NET Framework v2.0.50727+Visual Studio 2005

PS: It works well in javascript.

Although at least one other person also confirmed this to be a "bug", the truth is that it is entirely by design, and the way collation works on Windows and the .NET Framework.

The trailing character at the end of the string is U+ff9e, a.k.a. HALFWIDTH KATAKANA VOICED SOUND MARK. It is treated as a diacritic attached to the preceding character, and as such it modifies it's identity, the same way that a latin letter plus a diacritic is not equal to the latin letter by itself.

It is the same issue that causes people to mistakenly identify this issue as a bug.

In fairness to Hatter Jiang, this is not strictly speaking quite as clearcut of an issue as the other one when the TAB is used as the base character, since the HALFWIDTH KATAKANA VOICED SOUND MARK is not really identical to the European notion of a diacritic. But it is given a secondary weight, and this is the way secondary sort weights are handled in both managed and unmanaged code....

And if you want, you can always use the IgnoreDiacritic flag on the CompareInfo.IndexOf method. :-)


This post brought to you by(U+ff9e, a.k.a. HALFWIDTH KATAKANA VOICED SOUND MARK)

# dono on 17 Feb 2007 9:31 PM:

Not to mention that attaching U+FF9E to wa is entirely meaningless. Also, it goes without saying that the velar glide /w/ is already voiced.

# Jason Truesdell on 17 Feb 2007 9:34 PM:

It's probably also important to point out that the reported behavior is more linguistically appropriate than what Hatter expects. ヷ should not match ワ. If the katakana were fullwidth, we wouldn't want to match ワ for ヷ (not sure if that's actually in common use; see the brief, which is in Japanese, for information); "va" does not contain "wa". The default behavior of IndexOf seems to be consistent with Unicode KC normalization, which is almost always going to be more linguistically appropriate than a codepoint-only comparison.

Generally, ヷ will be substituted with ヴァ in Japanese typography anyway, so the issue is almost moot.

Of course, the character specified is tricky because this form is a fairly recent use of katakana, attempting to represent v sounds more faithfully in Japanese, and there are inconsistencies in actual usage. (バイオリン、ヴァイオリン, maybe even ヷイオリン could be imagined).

Some people argue that a search for バイオリン should include matches for ヴァイオリン, since they are relatively simple orthographic variants.

However, I doubt even they would want to have ヴァイ match ワイ (vie/why). Accordingly, the default behavior of .IndexOf seems reasonable to me, and more appropriate than JavaScript's behavior.

referenced by

2007/07/06 A non-spacing mark and a diacritic are not always the same thing

2007/02/17 Why I think System.String.IndexOf(Char) sucks

go to newer or older post, or back to index or month or day