Falling over the edge of a conceptual collation cliff

by Michael S. Kaplan, published on 2006/01/15 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/15/512928.aspx


With a blog title like Sorting It All Out I make no bones about my somewhat obsessive interest in collation. Well, this is unapologetically one of those interesting collation posts. :-)

You may recall if you are a regular reader that I have mentioned our collation tester Ryan Cavalcante before.

He found an interesting bug very late this last week that exists in managed code and which has been around since the very first version of the .NET Framework.

Remember when I was talking about sort elements and about how CompareString would treat  Æ, a.k.a. U+00c6 a.k.a. LATIN CAPITAL LETTER AE, as if it were the two characters?

Well, what would you expect the following two calls to return?

int idxA;
int idxE;

idxA = CompareInfo.GetCompareInfo("en-US").IndexOf("0123456789Æ123456789", "A");
idxE = CompareInfo.GetCompareInfo("en-US").IndexOf("0123456789Æ123456789", "E");

I mean the rules say that if string comparison would treat to things as being equal that an IndexOf call should be able to find the character(s) later.

And of course in the world of attempting to be consistent there are occasionally times that you have to choose what you want to be consistent with (like I pointed out in The string is freaking empty!) but it seems like it would be a good idea to be consistent whenever possible.

Keeping that in mind, in those two calls above, idxA is 10 and idxE is -1 (the call returns the index within the first string where the second string can be found).

Ok, so there is definitely a bug here.

But then I stopped and wondered something about what we actually expect to be returned here. Which call has a bug?

It is easy to claim that we should always return an index, but depending on what you are trying to do with the index, it could be useless, since there is no substring to actually find.

Ok, let's take a step back and look at the unmanaged version of this, FindNLSString (new in Windows Vista).

This function has the same bug of course, but then there is the additional OUT parameter that returns the length of the found string. This makes REPLACE operations much easier and more flexible, but imagine what happens when you try to call it with those same strings I ran through IndexOf above.

You don't have to imagine too long, or even to try it if you running the CTP. I'll tell you, there is another bug there (it is returning a somewhat random negative number at the moment!).

But what should it return? I mean, it seems like there is no good value to return there other than 0, since you can't ever return half of a character length (and even if you could it would not be useful).

So looking at both the managed and unmanaged versions, perhaps it is best for them to both fail -- meaning maybe the bug is in the first call returning a value at all. Because if you can't give a length for what you found, how can you say you found something at all?

As an aside, the call thats you would expect to succeed:

idxAE = CompareInfo.GetCompareInfo("en-US").IndexOf("0123456789Æ123456789", "AE");
idxAE = CompareInfo.GetCompareInfo("en-US").LastIndexOf("0123456789Æ123456789", "AE");

do in fact succeed, though the unmanaged FindNLSString analogues still have that bug with the bogus pcchFound value. But I think that is okay (other than the pcchFound bug of course!) since if you have the entire string in a capturable form then it makes sense to return success.

I think this might lead to a fairly consistent model....

Anyway, it was a fun bug to look into and even more fun to think about conceptually, since what the actual behavior should be requires real thought. I suspect this could even lead to some lively debates at work among interested parties next week!

 

This post brought to you by Æ (U+00c6 a.k.a. LATIN CAPITAL LETTER AE)


no comments

go to newer or older post, or back to index or month or day