I agree with you 100%. But we're both wrong (according to the spec)

by Michael S. Kaplan, published on 2010/12/22 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/12/22/10107967.aspx


It was last week, in response to You can't ignore crap and hope it won't cause problems... that Cheong commented:

Yet in the question, we'd expect tar.IndexOf(r) returns -1 because content of r does not exist in tar. I can imagine having it return 0 will case some infinate loop problem in certain data stream processing functions if they're lazy enough to use string manipulation functions to process data.

This ends up being an interesting design decision!

I'll explain how things ended up where they did.

First comes the easy question: What do you return for "meow".IndexOf("") exactly?

The explicit decision that was made was that every string both implicitly starts and ends with the empty string.

Thus returning 0 here "makes sense" by that design.

Ignore the "".IndexOf("") inconsistency here, of course!

I believe Java does the same thing, though it has been a while since I have done much with Java. Perhaps someone else can confirm.

Now I think the design is kind of stupid, for what it is worth. For the same reason Cheong was thinking -- the possibility of infinite loops.

In fact, the very first version of FindNLSString that I checked in had behavior I believed to be more intuitive, but it was actually my manager at the time who came to me shortly thereafter who mentioned I was not being consistent with .Net. And since that was the whole reason FinsNLSString was being added, this was a blocking issue.

Now while grumbling and doing the research to get the behavior consistent (I was doing both at the same time since consistency with what I thought of as incorrect design is a worse sin than being half right), I found several inconsistencies in .Net as well. That manager found these inconsistencies very frustrating (though in truth he isn't the one who caused the problem; the parts he wrote were consistent), and he jumped in to fix the managed code to be consistent while I fixed the [new] native code to have the same behavior that he was busy making sure would be consistent.

Anyway, where was I?

Oh yeah, with "hiss".IndexOf("") returning 0.

Now when you have strings with no weight, they compare as linguistically equal to the empty string.

Thus "\uFFFD".Compare("") is expected to return 0.

Now there are some standards bodies in parts of the world I am not going to name at this moment that would take statements like:

    "hiss".IndexOf("") == 0
    "\uFFFD".Compare("") == 0

and then make the claim that

    "\uFFFD".IndexOf("") != 0

but for the sake of a fragile attempt at consistency, this route was not taken -- and thus the zero length string is indeed assumed to adorn the front of that string.

Native code and managed code still look at things that way, and huge chunks of the checkin suite verify this behavior is not broken by well-meaning developers who might try and "fix bugs" without realizing that they aren't considered bugs....

So, to summarize the point to Cheong, I agree with you 100%. But we're both wrong according to the spec.

Perhaps the spec was wrong, but I'm pretty sure taking that route with my changes would have created an uncomfortable working environment for me back then. and I doubt I would have won the argument in the long run anyway.... :-)


McDowell on 23 Dec 2010 3:52 AM:

RE: Java

"foo".indexOf("") == 0
"".indexOf("") == 0
"\uFFFD".indexOf("a") == -1
"\uFFFD".compareTo("") > 0

Michael S. Kaplan on 23 Dec 2010 4:32 AM:

Oh yes, now I remember -- the StartsWith/EndsWith behavior in Java do this implicit "there's an empty string there" but IndexOf does not; in .Net, StartsWith/EndsWith actually *use* IndexOf/LastIndexOf to get their results, so the choice was either have them be inconsistent like Java (where a string can start with something that returns no index) or have them both work the same way and just have this problem of returning an index that isn't valid.

I don't agree with the choice that .Net's design made here, but it is now consistent (as I suggest in this blog), at least. I personally find both designs to be weird, and by removing that arbitrary "there is an invisible empty string in front of every string" thing, all of these methods could have been made entirely consistent with each other and with common sense.

So really both Java and .Net stink here, though for slightly different reasons. :-)

Michael S. Kaplan on 23 Dec 2010 4:34 AM:

Also, Java does not give U+fffd zero weight in collation, so you have to do the last two tests with some other character that Java treats like nothing is there....

Cheong on 27 Dec 2010 6:10 PM:

I don't know, perheps it does make sense have different comparing rule that treat these zero weight characters as they have weight if the whole string contains *only* zero weight characters. I think this should match the logic most of us is expecting.

When you compare something with two bowl of water, you can possibly ignore the case if I'm adding a drop of water to one side. But if all you have to comapre is that drop of water, you're not expected to ignore it.

So in my opinion, it's the spec that have to fix if this would introduce inconsistancy.

Michael S. Kaplan on 27 Dec 2010 11:41 PM:

You have to remember that in the database situation, you are not comparing strings, you are building sort keys -- so you have to provide a weight that will always be there....


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day