A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

by Michael S. Kaplan, published on 2007/10/09 10:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/09/5375754.aspx

Previous posts in this series:

Up until now I have been mainly talking about sortkey values and the underlying sort weights in Windows that are the source of them.

But this post is going to talk about something other than LCMapString/LCMapStringEx with the LCMAP_SORTKEY flag, that wonderful function that will take any string and based on the modifying flags you pass will return a deterministic binary value that can be used to represent it any time you want to compare it against other binary value created the same way.

Because as cool as all of that is, it takes time!

I am not even done describing all of the different factors that impact sort key creation and it is already quite obvious that it is involved.

There are clearly times when you are comparing two strings where even though you might want all of the data that sits behind the process to be brought to bear that you really don't want to have to do quite so much work.

Isn't there some lazier version of the process that won't be all anal retentive staying until the bitter end, that will cut and run as soon as it knows the answer?

Yes, in fact there is!

And we call it CompareString/CompareStringEx!

Their job is to work for as long as they to have to to get you the answer to the "Which Comes First?" question, and no longer.

So you can think of it as a real lazybones, cutting corners at every step it can get away with.

You may have met people like this before in your life; I wonder what your opinion of them was?

I mean, CompareString and CompareStringEx are not the kids who used to sit in the front of the bus, report you for not having a hall pass, and getting you in trouble.

But they are clearly the kids who used to get away with stuff without always putting in the time that you did, following the rules....

We thought they were so very cool growing up, and maybe we were right.

Of course, as I approach my 20th high school reunion next year, curious as to how everyone is doing (perhaps even curious enough to attend!), I can't help but wonder about the price one pays for going through life, always taking the shortest way out of getting the work done.

What happens is that mistakes are sometimes made.

Forgetting to tag second base, omitting the dot on the i or the cross on the t, however you want to think if it. It maybe wasn't intentional, but when you get called out, you have to take your licks and try to do better next time.

And to be honest, it happens here too!

There are times that the results of comparing the sort keys returned on two strings run through LCMapString/LCMapStringEx with the LCMAP_SORTKEY flag returns different results than CompareString/CompareStringEx looking at the same two strings (#12 from How to track down collation bugs).

And in almost every case (in fact, offhand the only exception I can recall is this bug) an objective analysis of the results tends to consistently show that when the results between these two methodologies are different, it is CompareString/CompareStringEx that is wrong.

Kind of ironic (when you consider the importance that SQL Server places on quality of results) that SQL Server's collation is really based on their CompareString snapshot's results! :-(

I wonder at times if one were going to take a step back and implement a brand new function from scratch (something on my mind these days, at least in theory) what one would do differently to keep the whole "take the easy way out" mentality from hurting the goal of correct results....

Imagine if such a function had a design goal of 100% compatibility with the LCMapString/LCMapStringEx with the LCMAP_SORTKEY flag comparison results....

Never mind, it won't happen (though it would an interesting design to architect); I am trying to fix model flaws here that deny the 100% compatibility goal. Sorry!

The other occasional kinds of bugs that are CompareString/CompareStringEx-specific are #4, #6, and #7 from How to track down collation bugs and again represent cases where the "do it fast" methodology around CompareString/CompareStringEx can lead to problems (though #6 type bugs are never found by SQL Server since they always pass the length and thus never even touch the (technically even faster) code that the -1 lengths can sometimes give).

In another stounding burst of irony, there is an even quicker version of the string comparison function found in the .NET Framework which, due to a bug that I only accidently found in the beginning of the Whidbey (2.0) product cycle, was never even called in versions 1.0 or 1.1. Subsequent to the decision to fix the bug by enabling the superfast function1, no fewer than five bugs were found by testers in the logic of this attempt to capture optimizable calls to CompareString-like code, due not only to the probelm above but an unrelated obsession with making English faster even if it makes the more complicated languages a bit slower2....

Maybe I am feeling kind of philosophical here (and I a bit too much of a geek to get all into the karma of code here!), but would this represent some kind of proof that taking the approach of cutting corners is itself inherently more likely to cause problems?

TWIST: In truth, #4, #6, #7, and #12 of How to track down collation bugs are only responsible for half the bugs I worked on in Vista in these functions. I'll talk about the other half in an upcoming post3. :-)


1 - My recommendation to yank it out since it we never called anyway was denied, but I was overruled given the universal reluctance to remove an intentional opimization from the code, even one that was never used.
2 - An occupational hazard of a US-based software company. :-(
3 - Foreshadowing -- truly a sign of quality literature.


This post brought to you by (U+246c, a.k.a. CIRCLED NUMBER THIRTEEN)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/09/07 Refusing to ignore some particular character's width isn't [always] an act of discrimination…

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2008/02/22 Optimized for English (oh, and also Invariant, and NOTHING ELSE) Redux

2007/10/12 That function is always faster! (well, except for that one case when it can actually be slower...)

go to newer or older post, or back to index or month or day