Elegant? Beyond compare...

by Michael S. Kaplan, published on 2007/08/31 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/31/4661705.aspx


Sometimes I wonder if the posts I write are not clear.

The good thing about the blog is that it is a lot like me talking to people.

Of course the bad thing about the blog is that it is a lot like me talking to people....

I was thinking about this when I read Jan Kucers contribution to the Suggestion Box:

Hello!

I'm reading your blog for couple of months and I've learned a lot of things.

We've seen a couple of examples what we really should not do and some hints what is better.

I'd like to know what is the most right way to compare strings while ignoring the case. (I work with managed classes but others could welcome unmanaged way as well.)

From some of your posts it is clear that lower-casing is better than upper-casing, since there are lower case characters without upper case equivavalents.

Also StringComparison.OrdinalIgnoreCase seems to be not the best win.

So strA.ToLower() == strB.ToLower() ?

or strA.ToLowerInvariant() == strB.ToLowerInvariant() ?

or string.Compare(strA, strB, true) ?

or string.Compare(strA, strB, StrinComparison.InvariantCultureIgnoreCase)?

Does using CultureInfo.CurrentCulture for string operations mean that the code will behave differently over the same data when running under different culture? If so, wouldn't it be better to choose any particular culture?

Well...is trustworthy case unaware comparation possible at all? :-)

Thanks for any hints on this topics. Or have you already answered this in past?

Jan

There is a lot in there that does not represent best practices, unfortunately.

There is a post in which I suggested a few guiding principles, entitled Browsing the shoals of managed string comparisons. In particular there is the bit at the bottom:

That third rule is the most important one....

For a slightly more complex breakout of items, you can see the post I mention in Something .NET does less intuitively than they ought that Josh Free wrote. Though for every person who has told me they found the table helpful, I have talked to at least one other person who found it made things more confusing -- the same way having all those different methods does.

But using the above three principles one should be able to resolve just about any question about appropriate string comparisons (whether case sensitive or insensitive)....

 

This post brought to you by and Ω (U+2216 and U+03a9, a.k.a. OHM SIGN and GREEK CAPITAL LETTER OMEGA)


# josh on 31 Aug 2007 1:13 PM:

"From some of your posts it is clear that lower-casing is better than upper-casing, since there are lower case characters without upper case equivalents."

Hm...  I've gotten the impression that uppercasing is better because there are characters where "a" and "b" both use "C" as their uppercase form.  (of course, this is only for environments without Windows' magical repository of linguistic data, so I'm more worried about being less wrong than correct)  If you ask to lowercase "C" you have to invent data or do nothing, either of which will produce a meaningless comparison.  Meanwhile, if you have some characters that have no uppercase form, uppercasing can safely do nothing (or delete the char, as long as it's consistent) without ambiguity.

Of course if there are also characters where "A" and "B" both use "c" as their lowercase form then you lose either way, but then winning wasn't on the plate to begin with.

# Michael S. Kaplan on 31 Aug 2007 2:29 PM:

You are correct -- uppercase is preferred to lowercase (and it is the underlying way that OrdinalIgnoreCase works, FWIW). Lowercasing can be evil for at least one character....

Jan Kučera on 4 Sep 2007 1:01 PM:

Hello Michael,

 well first thank you for getting into my question. Now I feel a little bit (actually a lot) ashamed for examples I have written and also discovered the very obvious thing - interpreting my understanding of things as understanding of others is absolutely bad idea - thanks to josh's comment.

 I found your links and linked articles very useful and I would probably not asked if I had found them before. For others, if there is anybody though, updated link to the "New Recommendations for Using Strings in Microsoft .NET 2.0": http://msdn2.microsoft.com/en-us/library/ms973919.aspx.

 Now, to problem which originally rised my question (I have begun to wonder if I get the things right enough). I have general strings from several cultures (and I know from which ones). I'd like to provide case insensitive search over these data to the user.

 I thought, I would search every piece of data in context of the culture it belongs to. That means, if the user is running my application in any culture he wants and is looking for I. It would much I and ı in turkish texts and I, i, İ, but not ı in english text.

 Honestly, is this stupid behaviour?

          Thanks, Jan

PS. Actually I'm not sure if I want the Ohm sign and Omega letter to match or not... which does not foretell anything good. :-/

Oh yes, and please accept my apologizes for making you wondering about your posts. I find them clear and useful and they really bring a lot to me. I've just started to take interest in international things as I came to your blog few months ago and I have huge amount of things learn. My wrong lowercase preference was just because I did not came to any post like josh's comment since I am here, but I did to some of yours about missing uppercased letters.

Michael S. Kaplan on 4 Sep 2007 6:32 PM:

Hi Jan,

I do not expect that they will be updating that document, because I have been asking them to do so since before it was initially published and they never have. :-(

But in the case of multilingual data, serach should in most cases still be based on the expectations of the person doing the search, not on that of the target culture of the data.

But I did not really take it too personally, so don't worry -- I have people tell me that they are confused by the area all the time and that sometimes it is only six months later that something I wrote makes sense. I am doing my best to help to shorten that time, for all sorts of reasons....

Jan Kučera on 5 Sep 2007 2:15 AM:

Okay, so you mean that if a turkish guy searches for "I", he doesn't want to include "i" in (english texts) results?

Michael S. Kaplan on 5 Sep 2007 2:34 AM:

Well, better to put it as they will not be as surprised/unhappy if their prefernces ARE respected then if you try to fold all of them (Iiİı) together and do not distinguish them....

You could make an exception for the Turkic case given that Windows has way over a decade of history doing it wrong -- so people masy be expecting it to keep being wrong (note that CompareString by default will still do it wrong today).

Although you can find targetted exceptions, they represent the exception, not the rule, as I mentioned here (note links to examples there). And the exceptions don't tend to scale all that well since most people do not know all of the comparison rules of all languages!

Jan Kučera on 5 Sep 2007 10:21 AM:

Fair enough, so it seems I'll just leave the comparing (and sorting) on the .NET Framework and believe that it will know what the user expects to see.

Thank you for your help and opinions!

Jan Kučera on 13 Sep 2007 10:45 AM:

One question more Michael, maybe a little bit OT. If the user expects the sorting and stuff in his locale, does it meen I should keep non-latin names in their native script? I mean I have a sorted list of authors. Should I keep the Russian authors in cyrillic or transcribe them to latin? Or add both to the list?

Michael S. Kaplan on 13 Sep 2007 11:37 AM:

Those kinds of items have to be based on actual user feedback -- as the answer will vary depending on the customers in question....


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day