The further you look into it, the further things stick out

by Michael S. Kaplan, published on 2005/08/03 13:20 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/08/03/447224.aspx


This line from a David Byrne song has a lot of truth in it and it could be applied well to collation. :-)

Just recently I was reading some posts from ものがたり (diary for AtsushiEno), specifically from June 17th of this year, here.

The first post is entirely correct -- code such as

CompareInfo ci = CultureInfo.InvariantCulture.CompareInfo;
Console.WriteLine (ci.Compare ("\u25EF", "\u25B6", CompareOptions.IgnoreNonSpace));

will indeed take two very different symbols (◯ and ▶, a.k.a. LARGE CIRCLE and BLACK RIGHT-POINTING TRIANGLE) which happen to have similar weights and you start ignoring secondary distinctions, you may find random things to be "equal".

Now I do not want to argue the philosophical issues about this being good or bad, but I do know that there is only a limited amount of space in the collation weight table, and using that limited space to make sure that symbols which do not need linguistic differentiation have unique weights is not in the best interests of anyone.

After consulting with our data PM, I re-ordered the source file to be by weight, and it was interesting to see some of the items that found themselves sorted together like this!

But it does get back to meaningful comparisons, and proper usage.

The second post was even more interesting, though the description is off -- the results of the code

CompareInfo ci = CultureInfo.InvariantCulture.CompareInfo;
Console.WriteLine ("AE".IndexOf ('\u00C6'));
Console.WriteLine ("AE".IndexOf ("\u00C6"));
Console.WriteLine (ci.IndexOf ("AE", '\u00C6'));

seem to indicate that string.IndexOf(char) is indeed doing an ordinal comparison, not just a culturally insensitive one (the first one returns -1, the others return 0).

This one seems like a bug to me, though it is a longstanding one (it happens in 1.0, 1.1, and 2.0). It might be easier to document at this point rather than change the behavior (one never knows who might be relying on it!).

For the sake of completeness, I'll draw this code out a little further:

Console.WriteLine("AE".IndexOf('\u00c6'));
Console.WriteLine("AE".IndexOf("\u00c6"));

CompareInfo ci = CultureInfo.InvariantCulture.CompareInfo;
Console.WriteLine(ci.IndexOf("AE", '\u00c6'));
Console.WriteLine(ci.IndexOf("AE", "\u00c6"));
Console.WriteLine(ci.IndexOf("AE", '\u00c6', CompareOptions.Ordinal));

ci = new CultureInfo("is-IS").CompareInfo;
Console.WriteLine(ci.IndexOf("AE", '\u00c6'));
Console.WriteLine(ci.IndexOf("AE", "\u00c6"));
Console.WriteLine(ci.IndexOf("AE", '\u00c6', CompareOptions.Ordinal));

The results here? They will be:

-1
0
0
0
-1
-1
-1
-1

Now the first one is the issue that was pointed out originally, and the next three show what happens when the NLS collation stuff kicks in. The fifth and eighth items show that when you force an ordinal comparison, you get the same behavior as that initial one. And the sixth and seventh items show what happens if you are doing the comparison in the context of a langauge that has other ideas about the AE ligature -- they will not treat it as equivalent to the letters A and E.

Both interesting issues, and it is cool to see people noticing what happens with collation!

 

This post brought to you by "▶" (U+25b6, a.k.a. BLACK RIGHT-POINTING TRIANGLE)
A symbol that is quite insistent that she and her friend U+25ef (LARGE CIRCLE) are just good friends

 


# Atsushi Eno on 3 Aug 2005 2:14 PM:

Haha, I wonder if you can read that Japanese text. BTW a few weeks ago I summarized almost all thoughts I wrote in those Japanese pages in English:
http://monkey.workarea.jp/lb/archive/2005/6-22.html
http://monkey.workarea.jp/lb/archive/2005/6-23.html
http://monkey.workarea.jp/lb/archive/2005/6-24.html
http://monkey.workarea.jp/lb/archive/2005/6-25.html
http://monkey.workarea.jp/lb/archive/2005/6-26.html

so you don't have to be messed by Japanese. (enjoy my Japanglish instead ;-)

# Michael S. Kaplan on 3 Aug 2005 2:19 PM:

I was able to read the page, with some assistance. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2005/11/03 My own personal thoughts about collation in the Mono project

2005/08/10 Double compressions -- Hungarian goulash?

go to newer or older post, or back to index or month or day