A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

by Michael S. Kaplan, published on 2007/10/09 10:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/09/5375754.aspx

Up until now I have been mainly talking about sortkey values and the underlying sort weights in Windows that are the source of them.

But this post is going to talk about something other than LCMapString/LCMapStringEx with the LCMAP_SORTKEY flag, that wonderful function that will take any string and based on the modifying flags you pass will return a deterministic binary value that can be used to represent it any time you want to compare it against other binary value created the same way.

I am not even done describing all of the different factors that impact sort key creation and it is already quite obvious that it is involved.

There are clearly times when you are comparing two strings where even though you might want all of the data that sits behind the process to be brought to bear that you really don't want to have to do quite so much work.

Isn't there some lazier version of the process that won't be all anal retentive staying until the bitter end, that will cut and run as soon as it knows the answer?

Their job is to work for as long as they to have to to get you the answer to the "Which Comes First?" question, and no longer.

So you can think of it as a real lazybones, cutting corners at every step it can get away with.

You may have met people like this before in your life; I wonder what your opinion of them was?

I mean, CompareString and CompareStringEx are not the kids who used to sit in the front of the bus, report you for not having a hall pass, and getting you in trouble.

But they are clearly the kids who used to get away with stuff without always putting in the time that you did, following the rules....

Of course, as I approach my 20th high school reunion next year, curious as to how everyone is doing (perhaps even curious enough to attend!), I can't help but wonder about the price one pays for going through life, always taking the shortest way out of getting the work done.

Forgetting to tag second base, omitting the dot on the i or the cross on the t, however you want to think if it. It maybe wasn't intentional, but when you get called out, you have to take your licks and try to do better next time.

And in almost every case (in fact, offhand the only exception I can recall is this bug) an objective analysis of the results tends to consistently show that when the results between these two methodologies are different, it is CompareString/CompareStringEx that is wrong.

Kind of ironic (when you consider the importance that SQL Server places on quality of results) that SQL Server's collation is really based on their CompareString snapshot's results! :-(

I wonder at times if one were going to take a step back and implement a brand new function from scratch (something on my mind these days, at least in theory) what one would do differently to keep the whole "take the easy way out" mentality from hurting the goal of correct results....

Imagine if such a function had a design goal of 100% compatibility with the LCMapString/LCMapStringEx with the LCMAP_SORTKEY flag comparison results....

Never mind, it won't happen (though it would an interesting design to architect); I am trying to fix model flaws here that deny the 100% compatibility goal. Sorry!

The other occasional kinds of bugs that are CompareString/CompareStringEx-specific are #4, #6, and #7 from How to track down collation bugs and again represent cases where the "do it fast" methodology around CompareString/CompareStringEx can lead to problems (though #6 type bugs are never found by SQL Server since they always pass the length and thus never even touch the (technically even faster) code that the -1 lengths can sometimes give).

In another stounding burst of irony, there is an even quicker version of the string comparison function found in the .NET Framework which, due to a bug that I only accidently found in the beginning of the Whidbey (2.0) product cycle, was never even called in versions 1.0 or 1.1. Subsequent to the decision to fix the bug by enabling the superfast function¹, no fewer than five bugs were found by testers in the logic of this attempt to capture optimizable calls to CompareString-like code, due not only to the probelm above but an unrelated obsession with making English faster even if it makes the more complicated languages a bit slower²....

Maybe I am feeling kind of philosophical here (and I a bit too much of a geek to get all into the karma of code here!), but would this represent some kind of proof that taking the approach of cutting corners is itself inherently more likely to cause problems?

TWIST: In truth, #4, #6, #7, and #12 of How to track down collation bugs are only responsible for half the bugs I worked on in Vista in these functions. I'll talk about the other half in an upcoming post³. :-)

1 - My recommendation to yank it out since it we never called anyway was denied, but I was overruled given the universal reluctance to remove an intentional opimization from the code, even one that was never used.
2 - An occupational hazard of a US-based software company. :-(
3 - Foreshadowing -- truly a sign of quality literature.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

2010/09/07 Refusing to ignore some particular character's width isn't [always] an act of discrimination…

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2008/02/22 Optimized for English (oh, and also Invariant, and NOTHING ELSE) Redux

2007/10/12 That function is always faster! (well, except for that one case when it can actually be slower...)