by Michael S. Kaplan, published on 2005/02/02 02:26 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/02/365251.aspx
Another reason why international test is not for amateurs....
Like they say at despair.com: "When you earnestly believe you can compensate for a lack of skill by doubling your efforts, there's no end to what you can't do."
It does not tend to be a problem on this team. But when other teams call our APIs, they somehow get it in their head that they should as a part of testing their component they should test the API. And not understanding how the APIs work, they start building random Unicode strings and passing them to CompareString.
Now CompareString is an API that was built to handle actual linguistically meaningful strings, not whatever random crap is generated. And while I will not claim that such a process cannot find problems, I can claim that this is not the sort of core scenario that causes me to lose sleep at night the way genuine bugs that might affect customers will....
An example of this happened over a year ago, in the newsgroups:
I've found that with certain Unicode strings, CompareStringW seems to be acting very strangey - you get behavior like this:
Strangely is a relative term, especially in a case where you are randomly generating strings....
A < B
B < C
C < A
A < B
B < A
I will admit that both are not so great. But you have to understand how the collation data is created and what it represents.
The goal is to give a way to sort every part of the Unicode BMP (basic multilingual plane), according to some particular selected locale. Any time a code point is not usefully defined in the table (e.g. it is not defined in Unicode, it is not a language/script that Windows has useful data for, or it is intentionally not given weight), it will not give useful linguistic information.
In other words, comparing random crap can give random crap results. :-)
These strings are randomly generated Unicode strings, so it may be that the problematic strings contain characters that are either unused or in certain parts of the Unicode space that are reserved (something similar to the private use space, maybe). So it may be that CompareStringW works fine for all real-world strings that we'd ever encounter. Still, it's a bit unsettling to see CompareStringW return
values that are so obviously wrong.
See above. But I will plow through the examples too, below.
A specific example - all three calls to CompareStringW return CSTR_LESS_THAN:
A = 1B37 1D96 4516
B = 30FE 4113 67BE
C = 0747 4443 40E6
Are there any errors here? From what I can tell, the three strings are all legal (null-terminated) UTF-16 strings - they're not ill-formed.
Well, string A is two code points not in the Unicode (which thus have no weight) and an Extension A ideograph (no weight prior to XP, near the end of the table XP and later).
String B starts with a Katakana iteration mark that affects the character before it and which would never start a string, another Extension A ideograph, and a standard CJK ideograph.
String C is made up of a Syriac letter and two more Extension A ideograph.
SUMMARY: All three are nonsense strings and nothing useful can come from testing with them.
A = 0D42 65F9
B = 1111 1B4F
String A is a Malayalam character and a CJK ideograph -- two characters one would never really expect to be together.
String B is a Hangul character and an undefined codepoint -- again not a valid test.
CompareStringW returns that A<B and B<A if I pass in -1 as the lengths (the documentation states that "if this parameter is any negative value, the string is assumed to be null terminated and the length is calculated automatically"). But if I calculate the lengths of the strings myself and pass those in, then it works proplerly (A>B and B<A). Passing in the string lengths does not help the case above, however.
Well, this is a type of situation that really is a bug, something that I have been working to correct for future versions -- there simply are many cases where if you pass invalid data we handle it oddly, specifically between the -1 and cch cases (which are basically two different code paths).
The -1 case is designed to not require a string wallk on the part of the caller (it literally plows the string one sort element at a time and stops when it knows the answer, and any time the two calls give different results, it is technically a bug (one that I am charged with trying to fix! <grin>). The mitigation for the time being is that invalid input is required to give invalid results....
Now these ARE bugs. And I will look into them, at some point. But it is fair to say that invalid strings really are the last frontier. All of the meaningful bugs come first, though. Because any day where the only people I frustrate are the testers who do not understand what they are testing, I will have no problems looking in the mirror in the morning....
The key? If you want to test CompareString, do so with actual word lists -- made up of actual useful strings in the target languages. Take an article in a target language and the first 200 or 500 words from it. Or get a list from a dictionary. Or from customers. Never generate random word lists that do not match the rules of the language or of Unicode (thinking about those illegal characters!). Work to pass appropriate flags that make sense for the application and the API itself. Do not pass code points not included in the Unicode standard if you are expecting back meaningful results.
And most importantly know what you are testing. If you need to test what the API does to typical strings in your appliction to understand if it is the right API to call, then that is a good idea. But you do not need to test the API itself, unless Microsoft is paying you to do that. The API works, and the important question is whether or not it works for your scenarios.
Another day I will give a good example of a scenario where it does not return the best possible results, and where another API is best considered....
This post brought to you by "ß" (U+00df, a.k.a. LATIN SMALL LETTER SHARP S)
(which is treated as equal to "SS" on sll platforms, so that German can use the default table with a ton of other languages....)
# Mike Dunn on 2 Feb 2005 10:36 AM:
# Larry Osterman on 2 Feb 2005 10:58 AM:
# Michael Kaplan on 2 Feb 2005 11:03 AM:
# Michael Kaplan on 2 Feb 2005 12:24 PM:
# Guille on 2 Feb 2005 12:57 PM:
# Michael Kaplan on 2 Feb 2005 1:00 PM:
# Brian on 2 Feb 2005 5:24 PM:
# Michael Kaplan on 2 Feb 2005 5:31 PM:
# Jonathan on 2 Feb 2005 11:35 PM:
# Michael Kaplan on 3 Feb 2005 1:09 AM:
2005/06/28 The 'grammar' of identifiers
2005/05/05 A few of the gotchas of CompareString
2005/02/03 What makes a string meaningful?
go to newer or older post, or back to index or month or day