Intelligent unmanaged string comparison

by Michael S. Kaplan, published on 2005/04/26 08:20 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/26/412079.aspx


If you look at the documentation for CompareString (but not LCMapString, though it probably ought to be there, too), there is a small security note in there:

security note Security Alert  Using this function incorrectly can compromise the security of your application. Strings that are not compared correctly can produce invalid input. Test strings to make sure they are valid before using them and provide error handlers. For more information, see Security Considerations: International Features.

That link about Security Considerations: International Features leads to an interesting discussion:

Comparison Functions

String comparisons can potentially present security issues. Because all comparison functions are slightly different, one function might report two strings as equal, but another function might consider them distinct. There are various functions that you can use to compare strings. The following are three examples of such functions.

  • lstrcmpi
  • lstrcmp
  • CompareString

The lstrcmpi function compares two character strings. The comparison is not case sensitive but is sensitive to the locale selected by the user in Control Panel. The lstrcmpi function does not perform byte comparisons. It compares strings according to the rules of the selected locale. The lstrcmpi function compares the strings by checking the first characters against each other, the second characters against each other, and so on until it finds an inequality or reaches the ends of the strings. The selected locale determines which string is greater (or whether the strings are the same). If no locale (language) is selected, the system performs the comparison by using default values. For some locales, such as Japanese, the lstrcmpi function might not be capable of comparing two strings. For more information, see CompareString.

The lstrcmp function is like the lstrcmpi function. The only difference is that it performs a case sensitive comparison.

CompareString is similar to lstrcmpi and lstrcmp except that its first parameter specifies a locale instead of using the user selected locale. Usually, CompareString, lstrcmp, and lstrcmpi evaluate strings character-by-character. However, many languages have multiple-character elements, such as the two-character pair 'CH' in Traditional Spanish. Because CompareString uses the locale passed in the locale parameter to identify multiple-character elements and lstrcmp and lstrcmpi use the thread locale, identical strings might not be found as equal. In addition, CompareString ignores undefined characters so it returns 0 (equal) for many string pairs that are quite distinct. A string might contain values that do not map to any character or it might contain characters with semantics outside the domain of the application, such as control characters within a URL. Test strings to make sure they are valid before using them and provide error handlers.

(Ignore the typo in RED above, they are going to fix that to read "so it returns CSTR_EQUAL").

Regular readers of this blog will recognize many of the concepts that are discussed, from Comparison confusion: INVARIANT vs. ORDINAL to The jury will give this string no weight, the issues here have all been covered. But it all boils down to intelligent use of the APIs. If you are trying to match the results of the file system (or of Win32 namespace objects like the names for events, names pipes, mutexes, etc.) then you should be uppercasing the string and then doing a binary comparison. If you are not, then you have to ask yourself why are you bothering to compare at all, since your comparison will not match the one that the opedrating system is about to do. It seems like common sense to me.

But then APIs like _wcsicmp do a lowercase comparison of strings, so what do I know? :-)

Ok, no fair to pick on an implementation that actually follows a standard; there are only two good reasons to uppercase here:

  1. The operating system does it for other purposes;
  2. The whole Georgian thing on Windows;

And there is a good reason to lowercase if you are doing full Unicode casing (which no one in Win32 or the CRT is): the Sharp S moves to two characters if you uppercase it, increasing the length of the string.

So the CRT can hardly be blamed for not going down that road, when no one was really thinking too much about it then anyway, can they?

Now this whole security warning applies equally to LCMapString and sort keys, since they are designed to work the same way as string comparisons; any time they do not, we consider it a bug. Now if the bug is in LCMapString then we can't really change the result without changing the version number so we'd be more likely to break CompareString in the same way. Though in practice for as long as I have been here it is always CompareString that is broken, not LCMapString. Something to do with how much easier it is to make a mistake when you try to do less work, maybe? :-)

I think what we need is a good way to match the operating system behavior that we can point to. People never read warnings that go on for paragraaphs about best practices like this blog, but they do pay attention to "Use function YYYY rather than function XXXX for this particular scenario, if you say it in enough places.

Of course we'd have to figure out what to call it and all that kind of stuff.

Let me go think on this for a bit....

 

This post brought to you by "" (U+1163, a.k.a. HANGUL JUNGSEONG YA)


# Ben on 26 Apr 2005 10:16 PM:

Wow, now that's something I didn't know! Thanks.

I wonder if it would help to have a "When to use this API?" section in the MSDN article. It could list typical uses, and typical non-uses, and typical misuses. For CompareString, it would might say
uses: character element by character element comprison skipping unknown characters.
non-uses: file-system comparisons, bit-wise comparisons.
mis-uses: thinking that the NORM_IGNORECASE will get you a case insensitive comparison of every character including unknown characters.

This would be like the "Drug Facts" table for OTC drugs.

# Michael S. Kaplan on 27 Apr 2005 12:11 AM:

Ben -- Yep, I am going to work with the doc. writer (as well as with team members) to see what makes sense here....

referenced by

2005/05/08 Similar descriptions does not mean similar methodologies

go to newer or older post, or back to index or month or day