A few of the gotchas of CompareString

by Michael S. Kaplan, published on 2005/05/05 02:12 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/05/414845.aspx


CompareString is one of the coolest APIs. I thought so even before I owned it, before I really even met the people who used to own it (or the woman who wrote it, for that matter).

But like any API, it can have its gotchas, its problems.

Now if you think of the NLS information as a huge database, then the Locale Identifier (a.k.a. the LCID) is its primary key. It is the very first parameter of CompareString.

If you are calling the non-Unicode version (CompareStringA), then rather than converting via the default system code page, it converts parameters via the default code page of the locale you pass in. Among other things, this means that you can't ever use CompareStringA to handle UTF-8 text.

Ok, let's move on to the all important second parameter, the one with the flags. I'll talk about each of the flags here:

NORM_IGNORECASE - Ignore case. A better name for this flag might have been IGNORE_TERTIARYWEIGHT since that is what it accomplishes (it masks the tertiary weight), although it is obviously too late to consider such a change. It can cause undesirable results when used in the comparison of strings containing characters that depend on the weight for vital information, which thankfully is a very small number of cases. But if you are not expecting "ʏ", "Y", and "y" (U+028f, U+0059, and U+0079, a.k.a. LATIN LETTER SMALL CAPITAL Y, LATIN LETTER CAPITAL Y, and LATIN LETTER SMALL Y) to all be equal, then you may want to think twice about throwing this flag into the mix. You will also lose the distinctions of the final forms for Hebrew (e.g. "מ" and "ם", U+05de U+05dd a.k.a. HEBREW LETTER MEM and HEBREW LETTER FINAL MEM), Arabic (e.g. "ش" U+0634 a.k.a. ARABIC LETTER SHEEN and its isolated, final, initial, and medial forms (ﺵ, ﺶ, ﺷ, and ﺸ) at U+feb5, U+feb6, U+feb7, and U+feb8, and other languages.

NORM_IGNORENONSPACE - Ignore nonspacing characters. A better name for this flag might have been IGNORE_SECONDARYWEIGHT since that is what it accomplishes (it masks the secondary weight). It can cause undesirable results when used in the comparison of strings containing characters that depend on the weight for vital information. The most visible example of this is in Korean, where U+ac00 (가, Hangul Syllable Kiyeok A) can suddenly be considered eqivalent to all of the following characters: 伽 佳 假 價 加 可 呵 哥 嘉 嫁 家 暇 架 枷 柯 歌 珂 痂 稼 苛 茄 街 袈 訶 賈 跏 軻 迦 駕 仮 傢 咖 哿 坷 宊 斝 榎 檟 珈 笳 耞 舸 葭 謌. For the rest of the Hangul syllables, some are better and some are worse, and the problem exists in other languages as well.

NORM_IGNORESYMBOLS - Ignore symbols such as "_", "#", and "*". The list of symbols is "increased" when SORT_STRINGSORT is specified, since punctuation is then also treated as symbols. This is often useful but can wreak havoc if you are searching for things like C++ or C#.

SORT_STRINGSORT - Treat punctuation the same as symbols. For example, a STRING sort treats co-op and co_op as strings that should sort together since the hyphen and the underscore are both treated as symbols. On the other hand, a WORD sort treats the hyphen and apostrophe differently, so that co-op and co_op would not sort together but co-op and coop would. The real documentation for this is built into the winnls.h header file:

//
//  Sorting Flags.
//
//    WORD Sort:    culturally correct sort
//                  hyphen and apostrophe are special cased
//                  example: "coop" and "co-op" will sort together in a list
//
//                        co_op     <-------  underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        co-op     <-------  hyphen (punctuation)
//                        cork
//                        went
//                        were
//                        we're     <-------  apostrophe (punctuation)
//
//
//    STRING Sort:  hyphen and apostrophe will sort with all other symbols
//
//                        co-op     <-------  hyphen (punctuation)
//                        co_op     <-------  underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        cork
//                        we're     <-------  apostrophe (punctuation)
//                        went
//                        were
//

NORM_IGNOREKANATYPE - Do not differentiate between Hiragana and Katakana characters. Corresponding Hiragana and Katakana characters compare as equal (e.g. "げ" U+3052 HIRAGANA LETTER GE versus "ゲ" U+30B2 KATAKANA LETTER GE) Calling LCMapString with the LCMAP_HIRAGANA or the LCMAP_KATAKANA flag on both strings would flatten the comparison in an analogous manner. There are many times that the distinction is important (certainly the times they are used are different such that searching through both may often give unexpected results).

NORM_IGNOREWIDTH - Do not differentiate between the halfwidth and fullwidth forms of characters. These two forms exist in Unicode for the sake of backward compatibility with legacy CJK standards that encoded the two forms. In those legacy standards, the halfwidth forms used one byte while the fullwidth forms used two bytes, and by convention the glyph was twice as large (e.g.  "ヲ"  U+30F2 KATAKANA LETTER WO  versus "ヲ", U+FF66 HALFWIDTH KATAKANA LETTER WO). Calling LCMapString with the LCMAP_FULLWIDTH or the LCMAP_HALFWIDTH flag on both strings would flatten the comparison in an analogous manner. Generally speaking, there are interesting times that each is often used for the sake of appearance or functionality, so while the initial purpose was for those legacy standards, modern usage is a bit more reasoned (example:properties in Japanese Access are full-width, while the descriptive string in the property sheet often uses the halfwidth string as it has a preferred appearance.

Looking at the third and fifth parameters, they are the actual strings being compared.

And then finally, the fourth and sixth parameters give the length of the string, in UTF-16 code points.

Now for actual usage, the intent is clear: through the use of meaningful strings that have defined weights in the Windows collation tables, developers have the opportunity to get back linguistically appropriate results. When you veer outside of this realm, you may not get the results you (or your users) are expecting. And as the info about flags above really indicates, the indiscriminate use of flags here is a really bad idea that can lead to non-intuitive results.

Now what would be intuitive? In my opinion the following approach is best:

  1. Passing potentially destructive flags with the API call, which will produce more search results
  2. Calling again without the flags, to get the smaller and more specific list
  3. Using these "preferred results" from #2 to prioritize #1 in any type of search list

We could call this the "Google" principle -- the large searchlist is not impessive because many choices would need review, but because the most relevant items are near the top snd you seldom need to look at the full list. I would highly recommend such an approach, to go along with the versioning issues that I have discussed in the past. Such an approach can give you intuitive results while minimixing confusing resultsets.

Now there are more issues that I could discuss, but I thought it might wait unil another day to talk about it a bit more....

 

This post brought to you by "ʏ" (U+028f, a.k.a. LATIN LETTER SMALL CAPITAL Y)


no comments

referenced by

2010/09/07 Refusing to ignore some particular character's width isn't [always] an act of discrimination…

2010/06/10 WORD SORT...Why'd it have to be...WORD SORT?

2008/02/22 Optimized for English (oh, and also Invariant, and NOTHING ELSE) Redux

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

2007/05/06 One product's feature is another product's bug -- just ask 'em!

2006/11/16 The problem of string comparisons, WORD sorts, and the minus that is treated like the hyphen

2006/11/01 If you add enough characters to a sort, intuitive distinction can suffer

2006/05/24 Is it punctuation, symbol, or diacritic?

2006/05/24 Invariant vs. Ordinal, the third

2006/05/04 Sort the words, sort the strings

2005/09/11 Fonts that are 'fixed-width' even if they do not claim to be

2005/06/24 LCMapString's *other* job

2005/05/11 Case/kana/accent/width sensitive SQL Server, for testing

go to newer or older post, or back to index or month or day