The fallacy of comparing out of context

by Michael S. Kaplan, published on 2006/07/09 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/07/09/658454.aspx

Perhaps you have heard of the Fallacy of quoting out of context, where support for an argument is produced by incompletely quoting a source -- where the incompleteness alters or changes the meaning.

(for more info, see the Wikipedia article about the subject)

Well, this post is not about that.

Instead, it is about an analagous practice in the world of collation, one that functions like CompareString can unfortunately support the propogation of.

If you call the function and specify a length (basically a count of UTF-16 code units) into the cchCount1 and/or cchCount2 parameters, the strings being passed in via lpString1 and lpString2 may be longer than the lengths.

Now while it may be possible that this is okay, it is likely to cause problems due to the fact that the actual comparison is done by comparing sort elements, not UTF-16 code units. The comparison you are making may be invalid if you truncate the strings this way.

In other words, by comparing the strings without the context of all of the relevant text, you may be corrupting the meaning of the comparison, and returning the wrong result!

This actually came up recently in relation to Mohawk, one of the new locales in Vista. You see, in Mohawk the colon (:) is used as a diacritic, so that e and e: have two different primary weights. So what happens if you call CompareString like this:

CompareStringW(MAKELANGID(LANG_MOHAWK, SUBLANG_MOHAWK_MOHAWK),
               0,
               L"file",
               4,
               wzFileUrl,
               4);

where wzFileUrl is a string like file:///C:/Debuggers/debugger.chm, looking for that "file" prefix. It will truncate that path at four code points, because the string you are looking for is using an entirely different letter there.

Now in this case it is a good thing since the goal here is obviously not a proper linguistic result, but it is easy to imagine actual strings returning equality when they are not equal and inequality when they are.

And what about canonically equivalent strings like "Å" (U+00c5) vs. "Å" (U+0041 U+030a)? If you pass a length of 1 for both strings then you are actually comparing U+00c5 and U+0041, which are obviously not ever the same by the standards of any locale.

If you think about problems like When Notepad's Find doesn't, they are clearly based on situations of assuming that "count of WCHARs" == "count of sort elements", and are caused by this very problem. Luckily, the Vista version of Notepad now uses FindNLSString to do its searching, which nicely avoids this particular problem.

And now it is on other developers to do the same -- and to not fall inti the fallacy of comparing out of context!

(Special thanks to Mike Dolenga for the inspiration of some of the concepts in this post!)

This post brought to you by : (U+003a, a.k.a. COLON)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/10/13 Every rose has it's Þ....

go to newer or older post, or back to index or month or day