The string is freaking empty!

by Michael S. Kaplan, published on 2005/05/19 08:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/19/419981.aspx

Not too long ago, we picked up a new tester (new to the area, not to testing in general) to own collation for Windows and the .NET Framework.

Now I will readily admit that is usually news that I take as well as getting a new hair stylist, which is to say not well at all¹. Because it is a hard area, and it usually takes a long time to get up to speed.

Remember back in January when I explained why I thought international test is an art (and why there are few fine artists)? Well, multiply that by infinity, take it to the depths of forever, and you will still barely have a glimpse of what the story is with collation testing.

So usually getting a tester who is new to this area sucks, even if it is not a new tester. Because if the tester is my only defense against complete and utter suckage, then what are the odds of getting an artist right out of the gate?

All I can say is that I should start buying lottery tickets. It turns out that Ryan Cavalcante is a freaking Rembrandt. :-)

He listened while I explained some of the basic principles, nodded, and went back to his office. I assumed I would not hear about it for a while until he started reporting dumb issues that testers from other teams tend to report that I have to be patient about. But if that was my expectation, then Ryan disappointed, big time.

Within days was able to report all kinds of edge cases that violated those principles. A few of these were known issues, or exceptions that people in his shoes find maybe 3-6 months later (if they ever find them -- most do not).

But some of them were actual bugs, and good ones, too. Good in the sense of longstanding but hard to find (since no one else had yet reported them), and impressive. I'd have probably given him a raise if I were important enough to have any say in that sort of thing, if for no other reason than to make sure he did not lose interest and leave....

Anyway, this post is not about one of those genuine bugs, but it is about one of those "known issues" I mentioned.

Anyway, the principle here was simple -- ideally, the results of any call to CompareStringW should be the same as getting the sort keys of the two strings from LCMapStringW and comparing them.

In this particular case, he was noting that if either of the strings is of zero length that CompareString will handle them and treat them as very light strings, and LCMapString with the LCMAP_SORTKEY flag would fail with ERROR_INVALID_PARAMETER.

The problem is that in this case, LCMapString is not being consistent with CompareString; it is being consistent with itself!

It almost goes without saying that when asking for LCMAP_UPPERCASE, LCMAP_LOWERCASE, or any of the other LCMAP_* mappings, it is reasonable to make it an error. So the sort key just ends up in the same mix. Being consistent with its neighbors just seemed like a better idea than being consistent with the function across the street. :-)

Now I suppose the sort key for a zero-length string would be "01 01 01 01 00" if the call did work. It is the de facto way to represent no weight whatsoever that would be consistent with the documentation of sort keys. I suppose databases must special case this since they obviously cannot consider a zero-length string to be equal to NULL (which would fail both function calls).

Calls to the managed method, CompareInfo.GetSortKey(), do not suffer from that need to be consistent, though is also does not return that mythical "01 01 01 01 00" I might have liked -- instead it returns a zero-length byte array (it throws a System.ArgumentNullException if you pass null to it). Ah well, sometimes you can't win 'em all.

Now there are other cool testers in this group, I'll talk about them some other time. But people who do the right thing with collation tend to get good billing....

1 - Autumn, the woman who used to cut my hair, has retired after getting married. Which is really awful considering what a hard time I have finding someone who cuts my hair in a way I like.

This post brought to you by U+0000, a.k.a. NULL)
A character that often gets the last word in any string of characters.

What a great testimony to the difficulty and the art of the tester. Thanks for writing this.

My pleasure, Ben! I have a very healthy respect for good testing, especially in an area that is as complex as this one. :-)

Only tangentially relevent, but I couldn't let it pass:

<i>I suppose databases must special case this since they obviously cannot consider a zero-length string to be equal to NULL [...]</i>

This is, I believe, what Oracle does. Clearly not because they are using your string compare functions, but never the less...

I also hate changing places to get hair cuts, and I just go to cheap barbers.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.