A non-spacing mark and a diacritic are not always the same thing

by Michael S. Kaplan, published on 2007/07/06 08:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/07/06/3722803.aspx


Ben asked me the other day via email:

This isn’t something I’m blocked on, but if you’re curious (I am!) –

I’m wondering about the expected behavior of CompareInfo.IndexOf.  I found that when searching for a Kannada string (“ಕನ್ನಡ”) I match versus a longer version that ends with what appears to be a non-spacing mark: “ಕನ್ನಡಿ” (hex dump below).  I can work around that by checking for trailing non-spacing marks at the end of the match.

However, I also experimented with searching for ‘e’ (0x65) in the text ‘e’ + combining acute { 0x65, 0x301 }.  In this case, IndexOf returns -1.  In both cases, I have a trailing NonSpacingMark, but in only one do I get a match.  Any idea what gives? 

    string text =    new string(new char [] { (char)0xc95, (char)0xca8, (char)0xccd, (char)0xca8, (char)0xca1, (char)0xcbf });
    string pattern = new string(new char [] { (char)0xc95, (char)0xca8, (char)0xccd, (char)0xca8, (char)0xca1 });
    CompareInfo compareInfo = CompareInfo.GetCompareInfo("kn-IN");
    int index = compareInfo.IndexOf(text, pattern, 0); // returns 0

    text =    new string(new char [] { (char)0x65, (char)0x301 });
    pattern = new string(new char [] { (char)0x65 });
    compareInfo = CompareInfo.GetCompareInfo("en-us");
    index = compareInfo2.IndexOf(text, pattern, 0); // returns -1

Ben

Well, the second case Ben describes is by design and is similar to issue I mentioned here and here.

Though this case is more convincing/compelling, since it really is a diacritic on a letter, etc.

The first case, however, although the additional character is U+0cbf (a.k.a. ಿ, KANNADA VOWEL SIGN I) is technically general category == Mn (Mark, Nonspacing), it takes more than that to impact collation -- in this case because the letter has primary weight.

Anyone want to guess what that reason might be? :-)

(I'll give people a chance to respond for this question, and I'll give some answers tomorrow or the next day)

 

This post brought to you by ಿ (U+0cbf, a.k.a. KANNADA VOWEL SIGN I)


harrymc on 25 Nov 2010 4:21 AM:

"tomorrow or the next day" was more than 3 years ago ...

Michael S. Kaplan on 25 Nov 2010 6:13 AM:

Holding out hope for a response, I guess? :-)


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day