CompareString prefers meaningful strings

by Michael S. Kaplan, published on 2005/02/02 02:26 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/02/365251.aspx

Another reason why international test is not for amateurs....

Like they say at despair.com: "When you earnestly believe you can compensate for a lack of skill by doubling your efforts, there's no end to what you can't do."

It does not tend to be a problem on this team. But when other teams call our APIs, they somehow get it in their head that they should as a part of testing their component they should test the API. And not understanding how the APIs work, they start building random Unicode strings and passing them to CompareString.

Now CompareString is an API that was built to handle actual linguistically meaningful strings, not whatever random crap is generated. And while I will not claim that such a process cannot find problems, I can claim that this is not the sort of core scenario that causes me to lose sleep at night the way genuine bugs that might affect customers will....

An example of this happened over a year ago, in the newsgroups:

I've found that with certain Unicode strings, CompareStringW seems to be acting very strangey - you get behavior like this:

Strangely is a relative term, especially in a case where you are randomly generating strings....

A < B
B < C
C < A

or even:
A < B
B < A

I will admit that both are not so great. But you have to understand how the collation data is created and what it represents.

The goal is to give a way to sort every part of the Unicode BMP (basic multilingual plane), according to some particular selected locale. Any time a code point is not usefully defined in the table (e.g. it is not defined in Unicode, it is not a language/script that Windows has useful data for, or it is intentionally not given weight), it will not give useful linguistic information.

In other words, comparing random crap can give random crap results. :-)

These strings are randomly generated Unicode strings, so it may be that the problematic strings contain characters that are either unused or in certain parts of the Unicode space that are reserved (something similar to the private use space, maybe). So it may be that CompareStringW works fine for all real-world strings that we'd ever encounter. Still, it's a bit unsettling to see CompareStringW return
values that are so obviously wrong.

See above. But I will plow through the examples too, below.

A specific example - all three calls to CompareStringW return CSTR_LESS_THAN:

A = 1B37 1D96 4516
B = 30FE 4113 67BE
C = 0747 4443 40E6

Are there any errors here? From what I can tell, the three strings are all legal (null-terminated) UTF-16 strings - they're not ill-formed.

Well, string A is two code points not in the Unicode (which thus have no weight) and an Extension A ideograph (no weight prior to XP, near the end of the table XP and later).

String B starts with a Katakana iteration mark that affects the character before it and which would never start a string, another Extension A ideograph, and a standard CJK ideograph.

String C is made up of a Syriac letter and two more Extension A ideograph.

SUMMARY: All three are nonsense strings and nothing useful can come from testing with them.

Another example:

A = 0D42 65F9
B = 1111 1B4F

String A is a Malayalam character and a CJK ideograph -- two characters one would never really expect to be together.

String B is a Hangul character and an undefined codepoint -- again not a valid test.

CompareStringW returns that A<B and B<A if I pass in -1 as the lengths (the documentation states that "if this parameter is any negative value, the string is assumed to be null terminated and the length is calculated automatically"). But if I calculate the lengths of the strings myself and pass those in, then it works proplerly (A>B and B<A). Passing in the string lengths does not help the case above, however.

Well, this is a type of situation that really is a bug, something that I have been working to correct for future versions -- there simply are many cases where if you pass invalid data we handle it oddly, specifically between the -1 and cch cases (which are basically two different code paths).

The -1 case is designed to not require a string wallk on the part of the caller (it literally plows the string one sort element at a time and stops when it knows the answer, and any time the two calls give different results, it is technically a bug (one that I am charged with trying to fix! <grin>). The mitigation for the time being is that invalid input is required to give invalid results....

Now these ARE bugs. And I will look into them, at some point. But it is fair to say that invalid strings really are the last frontier. All of the meaningful bugs come first, though. Because any day where the only people I frustrate are the testers who do not understand what they are testing, I will have no problems looking in the mirror in the morning....

The key? If you want to test CompareString, do so with actual word lists -- made up of actual useful strings in the target languages. Take an article in a target language and the first 200 or 500 words from it. Or get a list from a dictionary. Or from customers. Never generate random word lists that do not match the rules of the language or of Unicode (thinking about those illegal characters!). Work to pass appropriate flags that make sense for the application and the API itself. Do not pass code points not included in the Unicode standard if you are expecting back meaningful results.

And most importantly know what you are testing. If you need to test what the API does to typical strings in your appliction to understand if it is the right API to call, then that is a good idea. But you do not need to test the API itself, unless Microsoft is paying you to do that. The API works, and the important question is whether or not it works for your scenarios.

Another day I will give a good example of a scenario where it does not return the best possible results, and where another API is best considered....

This post brought to you by "ß" (U+00df, a.k.a. LATIN SMALL LETTER SHARP S)
(which is treated as equal to "SS" on sll platforms, so that German can use the default table with a ton of other languages....)

# Mike Dunn on 2 Feb 2005 10:36 AM:

>comparing random crap can give random crap results

Also known as "GIGO" :)

# Larry Osterman on 2 Feb 2005 10:58 AM:

I'd noticed that sharp-s and SS comparison worked in the neutral culture, I'd always wondered about that. Thanks.

# Michael Kaplan on 2 Feb 2005 11:03 AM:

Heh, I have had three different people ping me about the sharp-s issue *yesterday*, no idea why. :-)

I'll probably blog about it soon, maybe explain some insight to why things are weighed as they are....

# Michael Kaplan on 2 Feb 2005 12:24 PM:

I'll talk another day about a helpful function you can use to at least know if a string can be meaningfully compared. It does not handle the linguistic aspccts of mixing scripts strangely, but it does handle the code points not in the sorting tables....

# Guille on 2 Feb 2005 12:57 PM:

Well... Personaly I think this is a typical documentation issue. My thinking is that if a function is capable of returning unconsistent data due to invalid input, an message should be added to the function's doc, like "if the input data doesn't conform to this and that, the results are undefined". This, which may seem obvious to you, may not be obvious to everyone, and above all such a sentence draws the attention to the fact that not *any* input data is necessary valid, at least from the function's point of view. It is important to keep in mind that many developers that really need to use the operating system's provided function are not familiar with details or structure of any other language than their (or even aware of how many other languages are out there currently in use), so creating random 'Unicode strings' (while in fact they are only random word arrays) may seem their 'obvious' choice.

# Michael Kaplan on 2 Feb 2005 1:00 PM:

Yep, I'll talk about that idea as well....

# Brian on 2 Feb 2005 5:24 PM:

It seems to me that given any input, you should avoid claiming that both A < B and B < A. That this should return 0 and SetLastError() with an appropriate value. Otherwise, somebody's Maps or Sets are going to get really hosed up by the lack of total ordering, when what they really should be doing is rejecting the key. Unless there's a way for me to easily verify that a string is valid.

(my partially uninformed 2 cents)

Now flaunting my ignorance, why the heck does CompareString take its string arguments as DWORD's?

(interesting blog, btw)

# Michael Kaplan on 2 Feb 2005 5:31 PM:

> It seems to me that given any input, you should avoid claiming that both A < B and B < A.

Agreed -- and it is a bug. However, since its not a mesningful string, it just has a lower priority.

> Unless there's a way for me to easily verify that a string is valid.

There is -- tune in later. :-)

> why the heck does CompareString take its string arguments as DWORD's

It does not -- it takes two LPWSTRs for its strings.

> (interesting blog, btw)

Thanks!

# Jonathan on 2 Feb 2005 11:35 PM:

From the CompareString MSDN link:

int CompareString( LCID Locale,
DWORD dwCmpFlags,
DWORD lpString1,
DWORD cchCount1,
DWORD lpString2,
DWORD cchCount2
);

I don't have the actual SDK with me at the moment, so I can't check the actual header.

# Michael Kaplan on 3 Feb 2005 1:09 AM:

This is a doc bug.

Both lpString1 and lpString2 should be LPCTSTR, and both cchCount1 and cchCount2 should both be int.

I'll report this to the appropriate people....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2006/01/12 The creation of sort keys does not always make sense

2005/06/28 The 'grammar' of identifiers

2005/05/05 A few of the gotchas of CompareString

2005/02/03 What makes a string meaningful?

go to newer or older post, or back to index or month or day