Elegant? Beyond compare...

by Michael S. Kaplan, published on 2007/08/31 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/31/4661705.aspx

Of course the bad thing about the blog is that it is a lot like me talking to people....

I was thinking about this when I read Jan Kucers contribution to the Suggestion Box:

For a slightly more complex breakout of items, you can see the post I mention in Something .NET does less intuitively than they ought that Josh Free wrote. Though for every person who has told me they found the table helpful, I have talked to at least one other person who found it made things more confusing -- the same way having all those different methods does.

But using the above three principles one should be able to resolve just about any question about appropriate string comparisons (whether case sensitive or insensitive)....

This post brought to you by Ω and Ω (U+2216 and U+03a9, a.k.a. OHM SIGN and GREEK CAPITAL LETTER OMEGA)

"From some of your posts it is clear that lower-casing is better than upper-casing, since there are lower case characters without upper case equivalents."

Hm... I've gotten the impression that uppercasing is better because there are characters where "a" and "b" both use "C" as their uppercase form. (of course, this is only for environments without Windows' magical repository of linguistic data, so I'm more worried about being less wrong than correct) If you ask to lowercase "C" you have to invent data or do nothing, either of which will produce a meaningless comparison. Meanwhile, if you have some characters that have no uppercase form, uppercasing can safely do nothing (or delete the char, as long as it's consistent) without ambiguity.

Of course if there are also characters where "A" and "B" both use "c" as their lowercase form then you lose either way, but then winning wasn't on the plate to begin with.

Hello Michael,

well first thank you for getting into my question. Now I feel a little bit (actually a lot) ashamed for examples I have written and also discovered the very obvious thing - interpreting my understanding of things as understanding of others is absolutely bad idea - thanks to josh's comment.

I found your links and linked articles very useful and I would probably not asked if I had found them before. For others, if there is anybody though, updated link to the "New Recommendations for Using Strings in Microsoft .NET 2.0": http://msdn2.microsoft.com/en-us/library/ms973919.aspx.

Now, to problem which originally rised my question (I have begun to wonder if I get the things right enough). I have general strings from several cultures (and I know from which ones). I'd like to provide case insensitive search over these data to the user.

I thought, I would search every piece of data in context of the culture it belongs to. That means, if the user is running my application in any culture he wants and is looking for I. It would much I and ı in turkish texts and I, i, İ, but not ı in english text.

Honestly, is this stupid behaviour?

Thanks, Jan

PS. Actually I'm not sure if I want the Ohm sign and Omega letter to match or not... which does not foretell anything good. :-/

Oh yes, and please accept my apologizes for making you wondering about your posts. I find them clear and useful and they really bring a lot to me. I've just started to take interest in international things as I came to your blog few months ago and I have huge amount of things learn. My wrong lowercase preference was just because I did not came to any post like josh's comment since I am here, but I did to some of yours about missing uppercased letters.

Hi Jan,

I do not expect that they will be updating that document, because I have been asking them to do so since before it was initially published and they never have. :-(

But in the case of multilingual data, serach should in most cases still be based on the expectations of the person doing the search, not on that of the target culture of the data.

But I did not really take it too personally, so don't worry -- I have people tell me that they are confused by the area all the time and that sometimes it is only six months later that something I wrote makes sense. I am doing my best to help to shorten that time, for all sorts of reasons....

Well, better to put it as they will not be as surprised/unhappy if their prefernces ARE respected then if you try to fold all of them (Iiİı) together and do not distinguish them....

You could make an exception for the Turkic case given that Windows has way over a decade of history doing it wrong -- so people masy be expecting it to keep being wrong (note that CompareString by default will still do it wrong today).

Although you can find targetted exceptions, they represent the exception, not the rule, as I mentioned here (note links to examples there). And the exceptions don't tend to scale all that well since most people do not know all of the comparison rules of all languages!