Why I think System.String.IndexOf(Char) sucks

by Michael S. Kaplan, published on 2007/02/18 01:27 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/02/17/1701561.aspx


Earlier I posted about Giving a character a new identity (by giving it some secondary weight).

Now that post, while true, only tells part of the story.

Now I am going to tell the other part....

Take the following code and you may be able to see where I am going before you even look at the results:

CompareInfo ci = CompareInfo.GetCompareInfo("ja-JP");
string st1 = "ヷ";

Thread.CurrentThread.CurrentCulture = new CultureInfo("ja-JP");
Console.WriteLine(ci.IndexOf(st1, "ワ"));
Console.WriteLine(ci.IndexOf(st1, 'ワ'));
Console.WriteLine(st1.IndexOf("ワ"));
Console.WriteLine(st1.IndexOf('ワ'));

string st2 = "\u0061\u030a";
Console.WriteLine(ci.IndexOf(st2, "a"));
Console.WriteLine(ci.IndexOf(st2, 'a'));
Console.WriteLine(st2.IndexOf("a"));
Console.WriteLine(st2.IndexOf('a'));

The results? They will be:

-1
-1
-1
0
-1
-1
-1
0

So what's the problem? Why does System.String.IndexOf(Char) behave differently than System.String.IndexOf(String), System.Globalization.CompareInfo.IndexOf(String, Char), and System.Globalization.CompareInfo.IndexOf(String, String), anyway?

Well, setting aside my disdain for all of the System.String shortcuts to globalization functionality that makes the real linguistics features of the System.Globalization namespace that much harder for developers both inside and outside of Microsoft to find (never mind the additional confusion about the confusing and incomplete flags they add), there is the fact that the System.String "shortcut" methods often contain actual shortcuts to try to be more performant, to try to keep from calling the "slower" globalization methods.

So this particular issue can be looked at as an over-optimization, a case where developers assumed that they would not need to call the "slower" method in this situation.

Were they wrong?

Well, in my view, yes. All of these shortcut methods are just plain bad if they ever do anything other than call the real methods in the System.Globalization namespace. Anything else makes for less maintainable code that requires modifying multiple bits if there are ever changes or problems to fix, and it is harder for testers to track all of these different places to verify correct behavior in.

Of course now I suppose it would be in some people's minds a breaking change to fix the errant method.

So let's make it more interesting and raise the stakes:

CompareInfo ci = CompareInfo.GetCompareInfo("sv-SE");
string st1 = "\u00e5";

Thread.CurrentThread.CurrentCulture = new CultureInfo("sv-SE");
Console.WriteLine(ci.IndexOf(st1, "a"));
Console.WriteLine(ci.IndexOf(st1, 'a'));
Console.WriteLine(st1.IndexOf("a"));
Console.WriteLine(st1.IndexOf('a'));

string st2 = "\u0061\u030a";
Console.WriteLine(ci.IndexOf(st2, "a"));
Console.WriteLine(ci.IndexOf(st2, 'a'));
Console.WriteLine(st2.IndexOf("a"));
Console.WriteLine(st2.IndexOf('a'));

The results here? You know in this "Swedish "A-Ring" case?

-1
-1
-1
-1
-1
-1
-1
0

So, that over-optimization is causing behavior differences in strings that are canonically equivalent in Unicode, to wit LATIN SMALL LETTER A WITH RING ABOVE versus LATIN SMALL LETTER A + COMBINING RING ABOVE.

And that is a bug, suggesting that just taking out this over-optimization case might be in everyone's best interests....

(Using the Swedish or Japanese results above is not required; it just makes the weirdness look worse. The bug is there either way)

 

This post brought to you by å (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE)


# Dean Harding on 18 Feb 2007 5:56 AM:

Actually, my opinion goes the other way: the System.String "shortcuts" should never have called the System.Globalisation methods.

It's not really a matter of "optimization" vs. "user expectations" because there really ARE some cases where you want the code-point behaviour.

Doing it that way would make System.String a simple "array of code-points" and the methods on it work that way. The System.Globalization methods are the actual lingustics methods.

Ah well, too late for all this anyway :-)

# Michael S. Kaplan on 18 Feb 2007 11:03 AM:

Hello Dean,

If people want the code point behavior, then the ORDINAL methodology is available. These shortcuts are actually designed to work linguistically, they just don't always do so....

# Mihai on 18 Feb 2007 1:27 PM:

<<make System.String a simple "array of code-points">>

Two opinions on this kind of things:

1. All functions handling one character should be removed. Completely. From .NET, Win32 API, C++. Because doing anything on one character is a problem: search, compare, changing case, you name it. Everything should be done on strings, return strings, and so on.

It is the only way to get correct linguistic results.

2. The storage should be separated from the string itself. You need access to the code points, you access the storage explicitly.

Then you will be able to do stuff like this:

  string str = "\u0061\u030a";

  str.length(); // gives you linguistic info

  str.storage.length(); // gives you storage info (code points)

The storage is locale-independent, the string is not. And the intention is always clear.

Ok, some more thinking on what can go wrong is needed, but these are the general ideas.

# Michael S. Kaplan on 18 Feb 2007 2:41 PM:

Well, actually, we use a hybrid approach:

1) For most purposes we use your #1

2) For NLS collation functions that take an LCID, we do #2 plus (we include other constructs like sort elements).

# Dean Harding on 18 Feb 2007 7:11 PM:

Mihai: I don't think I disagree, specifically. My suggestion would just have been that the System.String class be your "string.storage" and SOMETHING ELSE be the linguistic stuff.

I guess that's just a product of where I usually work, though. Most of my string manipulation stuff (in my day job) come from manipulating email and SMS messages, both of which are predominately ASCII-based (at least, at the level I work on - the raw protocols). If most of my work was on web pages, or a text editor or something then I suppose I'd go with your suggestion...

You can't please everybody :-)

# Michael S. Kaplan on 19 Feb 2007 12:44 AM:

The only problem with THAT idea (which by the way there are people on the BCL team who would have preferred that approach in retrospect) is that there would be no linguistic support in the vast majority of apps.

And I just can't be a complerte fan of that sort of approach.... :-)

# Mihai on 19 Feb 2007 1:00 PM:

<<My suggestion would just have been that the System.String class be your "string.storage" and SOMETHING ELSE be the linguistic stuff.>>

Technically, it does not matter how you call things.

But for the perception of the one reading the code, it does (think #define BOOL int, before bool was standard).

A string is something containing text, and text is associated with linguistic properties in one's mind.

And since System.String has stuff like ToUpper, it is already "too dirty" to be a plain storage (because ToUpper is a locale-sensitive operation).

So I really think that String *is* the right thing for linguistic behavior.

# Mihai on 19 Feb 2007 1:06 PM:

<<Well, actually, we use a hybrid approach>>

Maybe in the implementation. But the idea was to make this explicit, for all programmers to see, not just an internal representation thing.

When I see str.length() and str.storage.length(), the intention becomes instantly clear, without even reading the doc.

It is probably too late to do this, without breaking backward compatibility. And I was also talking about C++, which is outside MS control :-)

And the idea was philosophical anyway. I don't really expect that <<All functions handling one character should be removed. Completely. From .NET, Win32 API, C++.>>

Who am I, who's going to listen to me? :-D

# Michael S. Kaplan on 19 Feb 2007 1:43 PM:

So I can't tell whether you think the anomaly I mentioned in this post about System.String.IndexOf(Char) is a bug to be fixed or a backcompat issue to be left alone? :-)

# Mihai on 19 Feb 2007 1:56 PM:

Bug :-)

Since String is a linguistic thing, I would expect linguistic behavior.

If you ever expose something like System.String.Storage.IndexOf(Char), then that should work on coding units.

# Mihai on 19 Feb 2007 2:04 PM:

What I think would make sense (at least for me :-)

 string st2 = "\u0061\u030a";

 // Linguistic behavior
 ci.IndexOf(st2, "a"); // -1

 // Remove as per rule #1
 ci.IndexOf(st2, 'a'); // undefined API error

 // Linguistic behavior
 st2.IndexOf("a"); // -1

 // Remove as per rule #1
 st2.IndexOf('a'); // undefined API error

 // Add this API, with non-linguistic behavior
 // and not affected by CultureInfo
 // working on coding units
 st2.Storage.IndexOf("a"); // 0

 // DO NOT add this API,
 // because non-linguistic behavior
 // in a CultureInfo context is dumb
 ci.Storage.IndexOf("a"); // undefined API error

# Mihai on 19 Feb 2007 2:07 PM:

<<System.String.IndexOf(Char) is a bug to be fixed or a backcompat issue to be left alone?>>

I think I did not answer the question.

It is clear it is a bug, but to fix, or not to fix, this is the question? Sounds like you are trying to push me in a corner :-)

Well, if it can be fixed without breaking compatibility, then yes, fix it :-)

Check with Raymond Chen :-D

# anupam on 24 Jul 2008 8:33 AM:

Sting str="C:\Documents and Settings\asriv5\Desktop\Login.jsp";
index i=str.lastIndexOf("\"); // not working why ???

plz give me the sol...

anupam


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day