Put in on my Tab, please

by Michael S. Kaplan, published on 2006/09/19 10:41 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/09/19/762106.aspx

The question that came in just the other day using code like the following:

Console.WriteLine("\t\u3094".IndexOf("\t")); // returns 0
Console.WriteLine("\t\u3098".IndexOf("\t")); // returns 0
Console.WriteLine("\t\u3099".IndexOf("\t")); // returns -1
Console.WriteLine("\t\u309a".IndexOf("\t")); // returns -1
Console.WriteLine("\t\u309b".IndexOf("\t")); // returns -1
Console.WriteLine("\t\u309c".IndexOf("\t")); // returns -1
Console.WriteLine("\t\u309d".IndexOf("\t")); // returns 0

There are a bunch of users who are pretty confused by the behavior here.... searching for a TAB combined with the following characters:

The report it is a bit confusing that the four in the middle behave differently than the three on the ends.

Of course to start I'll point out that these strings are not all that meaningful, linguistically. How does one voice or semi-voice a tab? :-)

To some extent you could consider it a side effect of the way that collation is implemented to achieve the results I discussed in Knock knock! Who's there? Kana! Kana Who?, but those voice and semi-voiced sound marks are given a diacritic weight only, just as we might do with U+030a (COMBINING RING ABOVE) or U+0327 (COMBINING CEDILLA). And the thing about characters that are only given diacritic weight is that they have no independent identity. They merge their weight in with the previous character, so look at the sort keys of all of these strings:

07 05 22 04 01 02 03 01 01 ff 02 ff ff 01 00

07 05 01 01 01 01 00

07 05 01 03 01 01 01 00

07 05 01 04 01 01 01 00

07 05 01 03 01 01 01 00

07 05 01 04 01 01 01 00

07 05 07 05 01 01 01 01 00

Now the sort key in purple is the undefined character -- in that one the second character has no weight so the weight is the same as that of the tab.

And the one in blue is using the iteration mark, which acts as a repeater. See how the primary weight 07 05 is repeated?

The one in green has a tab with no extra weight, follosed by another character.

And those four in the middle in red each have some diacritic weight put on them. Kind of a less extreme case of the phenomenon I described in What do you get when you combine a base character with a buttload of diacritics?

So you can think of them not as TAB characters, but as TAB++ or something. Trying to find the TAB inside of them is like trying to find the "a" inside the "". Which you cannot do, because it really is no longer an "a" anymore....


This post brought to you by (U+309d, a.k.a. HIRAGANA ITERATION MARK)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/07/06 A non-spacing mark and a diacritic are not always the same thing

2007/04/10 When methods use collation to 'disturb the peace' we charge them with being 'out of sorts'

2007/02/17 Giving a character a new identity (by giving it some secondary weight)

go to newer or older post, or back to index or month or day