Browsing the shoals of managed string comparisons

by Michael S. Kaplan, published on 2005/06/12 21:15 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/06/12/428429.aspx


It was a little over a month ago that I pointed out that Similar descriptions does not mean similar methodologies, and I spent a little time comparing many of the Win32, Shell, Shlwapi, CRT, and Kernel methods of doing case-insensitive comparisons. And of course some people looked at that topic and saw it as a proof that managed code is the way to go to avoid the confusion over which method to use.

But as I hinted at in this post, things are not so simple in the managed world that you can really count on all of that confusion going away. So you can think of this post as the manged version of that issue.

First there is the core method, the managed equivalent of CompareString, the CompareInfo class and its Compare method. One of the overrides for that method takes a CompareOptions enumeration member that lets you get at the gamut of insensitive operations for case, nonspacing mark, symbol, kana, width, etc., as well as getting to Ordinal (and as of Whidbey post Beta 2, OrdinalIgnoreCase). Since you can do it off of any culture, you have access to using the invariant culture as well.

Then there is the String.Compare method, whuch supports a subset of those operations, but some of the methods take a CultureInfo object and others take a StringComparison enumeration member (many of which give access to the same things a CultureInfo would like InvariantCulture or CurrentCulture, or an OrdinalIgnoreCase comparison.

Of course there is also a String.CompareOrdinal method which does the same thing that the StringComparison enumeration with the Ordinal comparison would do.

I would be remiss if I did not point out the String.ToUpper/String.ToUpperInvariant methods, especially since the first of them has an override that takes a CultureInfo which makes the second one not entirely necessary. Those extra invariant methods were added in Whidbey; I would not have strenuously objected if they had been taken out prior to shipping. :-)

There is also the new StringComparer class, which has some interesting remarks in it:

You might be confused about how to use the System.StringComparer properties because of a seeming contradiction. The value of each System.StringComparer property is a System.StringComparer object. However, the System.StringComparer class is declared abstract (MustInherit in Visual Basic), which means its members can only be invoked on an object of a class derived from the System.StringComparer class, but each property is declared static (Shared in Visual Basic), which means the property can be invoked without first creating a derived class. This appears to be a contradiction.

You reason you can call a System.StringComparer property directly is because each property actually returns an instance of an anonymous class that is derived from the System.StringComparer class. Consequently, the type of each property value is the base class of the anonymous class, not the type of the anonymous class itself.

I think I can parse that. But for what it is worth, a StringComparer (which includes properties to get at CurrentCulture, CurrentCultureIgnoreCase, InvariantCulture, InvariantCultureIgnoreCase, Ordinal, and OrdinalIgnoreCase flavors of itself), could also have been covered by a CompareInfo -- maybe we should have made CompareInfo inherit the IComparer and IEqualityComparer interfaces that the StringComparer brings to the mix? :-)

And every time something takes a CultureInfo for comparisons, it is actually pretty much using that CultureInfo's associated CompareInfo. Which you cannot pass in there most of the time, for reasons of type safety.

Ok, the above will hopefully free people of the illusion of simplicity in managed code. And I did not even get into all of the Hashcode providers, dictionaries, hash tables, and lists that would use these various comparison objects, all of which have to be created in particular ways. I will work on trying to sort some of these out another time, in another post.

For now I will point people at my post about the new string recommendations, and suggest that everyone take that one simple piece of advice I gave:

Use appropriate comparison methods.

Use appropriate comparison methods.

Use appropriate comparison methods.

Use appropriate comparison methods.

The easy (well, easier at least, I think) principles:

If you can follow those three rules, you will seldom if ever go wrong on using appropriate comparison methods.

 

This post brought to you by "¢" and "£" (U+00a2 and U+00a3, a.k.a. CENT SIGN and POUND SIGN)
(As the saying goes, in for a penny, in for a pound....)


# TheMuuj on 13 Jun 2005 6:35 PM:

I looked up this post again just to make sure I was doing the right thing when comparing filenames in Whidbey. I needed to do a wildcard match on a set of files, and remove those files from a List<string> of filenames.

I knew it needed to be case-insensitive (unless I plan on porting to Mono on Unix), but for some reason I was worreid that NTFS might use culture-based comparisons for file names. I do know you've talked about how the case-mappings are stored in the file-system, so I have this gut feeling that using StringComparer.OrdinalIgnoreCase might not be good enough.

Which leads me to believe that .NET either needs a FileName class that would be similar to its Uri class, or perhaps just a PathComparer that implements IComparer<string> and IEqualityComparer<string>.

Still, different file systems might behave differently, so I suspect this is impossible to solve 100% of the time, especially if you are comparing relative paths. What if your files move from one partition/filesystem to another during the lifetime of the program. Can you use the same filename comparisons?

Or am I just worrying too much?

# Michael S. Kaplan on 13 Jun 2005 6:51 PM:

You are just worrying too much. :-)

The OrdinalIgnoreCase will give you good behavior for FAT and NTFS. For other file systems the story is not so clean (as you point out), but for the basic ondes it is the way.

File systems cannot use locale-specific case mappings, or they would be unable to be used between machines, or even between changes to the settngs!

# TheMuuj on 13 Jun 2005 9:36 PM:

Thanks for giving me the answer I wanted to hear, because I'm the type of person who would put a huge workaround in just to keep an edge case from breaking that would probably not come up in production.

Knowing that OrdinalIgnoreCase is good enough will stop me from thinking about it too much.

# Anonymous on 5 Jul 2005 12:51 PM:

Since these immortal words were spoken by the voice of Tim Blaney to Ally Sheedy, I think every...

# Andy Bantly on 21 Jul 2005 11:23 AM:

This may seem off topic but it is not, just transform your thinking back to a kinder, gentler time. ... If the MS C Runtime library support for the strxfrm() function took into account sort collation, eg. dictionary vs. phonebook in some locales, then all things collapse onto themselves and the handy strcmp() wins the string compare battle. Sort collation is that multicultural blot that developers hate and endusers love. It is found in the regional settings control panel applet and is fun to dink around with for testing your code. I believe that strxfrm should tokenize those letters that have multiple incarnations, like the 'ss' verse the german version, into the same token. Alas, too bad Microsoft is too busy trying to litter the playing field with another programming language.
It may seem like I put this lightly and comically but it is an important subject. Speed and efficiency have always been the earmarks of sorting and string comparison. The current implementations of string comparisons in the Platform SDK are overweight and slow and does not always handle sort collation.

# Michael S. Kaplan on 21 Jul 2005 11:40 AM:

Hi Andy,

Well, nice flamebait in any case. :-)

The people writing new programming langugaes are not standing in the way of the people who ar working on collation (me and my team, mostly).

I am really not sure what you refer to in the last part about what is not handled.

But your original premise is incorrect -- there is no C runtime function that does it right, linguistically speaking, with as many options and features as the NLS functions provide....

Tanveer Badar on 21 Dec 2007 2:46 PM:

Each use of override is incorrect, it should be overload and use IStemmer for derivatives. :)


referenced by

2007/08/31 Elegant? Beyond compare...

2005/07/05 'Need more input, Stephanie!'

go to newer or older post, or back to index or month or day