Optimized for English (oh, and also Invariant, and NOTHING ELSE) Redux

by Michael S. Kaplan, published on 2008/02/22 09:46 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/02/22/7846597.aspx

This post may not be of interest to all readers since it not only covers technical topics (which can easily turn off half my readers!) but does so in a slightly more sociological kind of way (which can easily turn off the other half!). Feel free to skip, as appropriate.

Although this post shares a conceptual framework with Optimized for English (oh, and also Japanese, and maybe a few others), that post was really about ClearType and the group that was at the time referred to most often as the Advanced Reading Technologies team, while this one is about string comparisons in the .Net Framework.

So we are thinking about two entirely different groups of people sitting in two entirely different [sets of ]buildings in two entirely different divisions of Microsoft, and there is no relevant organizational or architectural connection between them....

My hints about this topic have been sprinkled throughout he blog but one of the more obvious ones can be found in a side paragraph in A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time):

In another astounding burst of irony, there is an even quicker version of the string comparison function found in the .NET Framework which, due to a bug that I only accidentally found in the beginning of the Whidbey (2.0) product cycle, was never even called in versions 1.0 or 1.1. Subsequent to the decision to fix the bug by enabling the superfast function¹, no fewer than five bugs were found by testers in the logic of this attempt to capture optimizable calls to CompareString-like code, due not only to the problem above but an unrelated obsession with making English faster even if it makes the more complicated languages a bit slower²....

Now some may think I was being a bit harsh, but the optimization is for string comparisons done on Unicode strings when the characters are all in the ASCII ( <= U+007f) range. Basically when the string is created (and remember that in .NET strings are immutable so creation time work is pretty much the only time that the contents can really change!), it walks the string and determines whether it is one of these simple ASCII-content strings.

Now obviously that is not all that needs to happen, since word type sorting (as compared to string sorts, discussed earlier in A few of the gotchas of CompareString) affects the way the hyphen (U+002d) and the apostrophe (U+0027) sort, and they are down in that range. And there were other exceptions, too. So a big map of the "fast" characters had to be put together for this "default" case.

If you are a developer I'd like you to take a moment and think carefully about the consequences of building such a map and using it in the pre-processing. Incredible effort is being gone to here to look for the most optimizable case, given the very real belief that this will happen often enough that the case is truly worth optimizing.

Of course many locales have exceptions for how the letters in this range sort -- and so these other locales have to be excluded too. The decision for how to decide what locales to use is a rather simple macro in a C++ header file (you can actually find it yourself in the Rotor source if you have spent time in there, it is fairly hard to miss):

//This is the list of locales which we currently understand are fast.
//We should only do fast comparisons in one of these locales.
#define IS_FAST_COMPARE_LOCALE(loc) \
(((loc)==LOCALE_INVARIANT) || (PRIMARYLANGID(loc)==LANG_ENGLISH))

Ah, so we are restricting ourselves to invariant and English (all of the English locales that are there) -- nothing else is "fast" here?

But of all of the cultures that the .NET Framework supports, about half of them don't use the default sorting table (i.e. they have exceptions or compressions or reverse diacritics or whatever) -- say about 70 of them, and a little over half of those remaining (say 40 of them) have exceptions/compressions that impact this default table³. Thus of all of the cultures in the .NET Framework, just under 3/4 of them could use this optimization.

Yet it is essentially restricted to English and Invariant (which is basically English).

So you notice the team bends over backwards and is even willing to slow down string creation⁴ in order to find this optimized case -- but only for English (no one chose to build up a "fast locale map" for the almost 3/4 of cultures that could benefit from the same optimization). Do we need further proof that the framework comes from a US-based company? :-)

When you consider the fact that there was a 1.0/1.1 bug that kept the optimization from ever being used (once it was turned on rather than being yanked out, they found several bugs in sorting results and each time I suggested the code be removed but was overruled -- it was a lot of fun being right but not so much fun not being heeded!).

In my mind, the proof that the code was never used and thus probably had bugs was sufficient for me -- the old code was good enough, and they could even try to remove some of the string initialization performance hit if they wanted to make all operations faster rather than enabling an unproven optimization in string comparison. It did seem like there was an almost religious battle about the heresy of removing an optimized (but dead) code path merely because it was dead....

Now I have alway looked at .NET as a full-size Skunk Works-type project from the very beginning and right on through Silverlight, with small teams of people working to get good proof of concept work and then after driving the concept bulking it up and then shipping it. The team is obviously huge and there is a lot of process there (so my Skunk Works notion is not a fully accurate picture of how things actually work in practice) but I mean in terms of mindset and focus.

I mostly mean it as a compliment from an engineering standpoint, with the exception of international support issues, which often fall off in proof of concept and don't always get fully added back later.

So what does this team have in common with the ClearType team? Well, clearly there is a particular scenario-based focus on a subset of the customer population in both cases; the .NET case is perhaps slightly worse since I think we wrote a lot of the code here and own most of the code now (even the parts we didn't write). Though on the other hand much of the overall ownership of ClearType started from a small group that splintered from the core typography team which obviously was required to have a wider focus that included other languages, so perhaps both teams are guilty of this scenario narrowing effect?

Now I don't want to make too much of this, since in the end even the original code (which never used the optimization) was fast enough to work. But I do want to point out how the intent shapes the implementation in ways both subtle and not so subtle....

1 - My recommendation to yank it out since it we never called anyway was denied; I was overruled given the universal reluctance to remove an intentional optimization from the code, even one that was never used.
2 - An occupational hazard of a US-based software company. :-(
3 - As an example of one that would not be included, consider Farsi/Persian, which has exceptions but obviously none in the ASCII range.
4 - Anyone who does not think that walking the string is not a performance issue does not understand the lesson taught by That function is always faster! (well, except for that one case when it can actually be slower...)!

This blog brought to you by B (U+0042, LATIN CAPITAL LETTER B)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day