Custom Case Mappings?

by Michael S. Kaplan, published on 2006/05/26 05:15 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/05/26/607881.aspx


George asked me via the contact link:

I was reading on MSDN from a topic titled 'Custom Case Mappings and Sorting Rules', and I am still not sure I understand what a 'custom case mapping' is. Can you help explain this?

I think that the topic George was referring to is .NET Framework Developer's Guide: Custom Case Mappings and Sorting Rules. After reading this topic, all I can say is that it is not surprising that he found it a bit confusing. :-)

Clearly the intent is to try to show some of the ways that cultures customize behavior surrounding collation and casing. And the target audience seems to primarily be people who are native English-speaking folk who have not had previous exposure to the many behaviors that most other languages provide in one form or another.

In this context, a custom case mapping is any mapping that is not the standard one -- which makeds it a very unfortunate one, since in the .NET Framework the word custom has an entirely different meaning, especially in the area of globalization where custom cultures have been provided.

Over 2/3 of the topic tries to explain Turkic casing (though it calls it Turkish casing and then mentions that it also applies to Azeri Latin rather than introducing the word 'Turkic' which I believe also applies to other languages not yet supported explicitly by locales on Windows).

Of the remaining four paragraphs, two of them mention "other custom case mappings and sorting rules that you should be aware of when performing string operations" but then goes on to talk about compressions in collation without calling them out explicitly.

In one case, it talks about the Microsoft convention for compressions in sorting but actually mistakes it for yet another situation of custom case mappings, which anyone who understands the issue would not call them:

The alphabets of nine cultures in the ASCII range (Unicode 0000- Unicode 007F) contain two-letter pairs where the result of a case-insensitive comparison, such as String.Compare, does not evaluate to equal when the case is mixed. These cultures are "hr-HR" (Croatian in Croatia), "cs-CZ" (Czech in the Czech Republic), "sk-SK" (Slovak in Slovakia), "da-DK" (Danish in Denmark), "nb-NO" (Norwegian (Bokmal) in Norway), "nn-NO" (Norwegian (Nynorsk) in Norway), "hu-HU" (Hungarian in Hungary), "vi-VN" (Vietnamese in Vietnam) and "es-ES" (Spanish in Spain) using the traditional sort order. For example, in the Danish language, a case-insensitive comparison of the two-letter pairs aA and AA is not considered equal. In the Vietnamese alphabet, a case-insensitive comparison of the two-letter pairs nG and NG is not considered equal. Although you should be aware that these rules exist, in practice, it is unusual to run into a situation where a culture-sensitive comparison of these pairs creates problems because they are uncommon in fixed strings or identifiers.

It then goes on to explain in a slightly provicinal way about exceptions without giving any real examples (or pointing out that several of the previous 'compression' languages also have exceptions):

The alphabets of six cultures within the ASCII range have standard casing rules, but different sorting rules. These cultures are "et-EE" (Estonian in Estonia), "fi-FI" (Finnish in Finland), "hu-HU" (Hungarian in Hungary) using the technical sort order, "lt-LT" (Lithuanian in Lithuania), "sv-FI" (Swedish in Finland), and "sv-SE" (Swedish in Sweden). For example, in the Swedish alphabet, the letter w sorts as if it is the letter v. In application code, sorting operations tend to be used less frequently than equality comparisons and therefore are less likely to create problems.

Of course there is also the specious claim about ASCII since out of all of the languages mentioned, exactly ZERO of them are limited to ASCII even in their standard exemplar characters.

The penultimate paragraph tries to reassure any of those target english-speaking developers a bit:

An additional 35 cultures have custom case mappings and sorting rules outside of the ASCII range. These rules are generally confined to the alphabets used by those specific cultures. Therefore, the likelihood of them causing problems is low.

And then it ends on a final misleading note:

For details about the custom case mappings and sorting rules that apply to specific cultures, see The Unicode Standard at www.unicode.org.

Since the Unicode Standard does not really refer to any of these concepts by these terms either, this note is probably not entirely helpful. But it once again tries to reassure native english speaking developers not to fear globalization.

The final coda is pointers to culture-insensitive work (though of course it undermines the message that there is nothing to fear in cultures by telling people how to learn to do without them!).

So George, for the most part I would recommend ignoring this article. Its viewpoint is provincial, its terminology is non-standard, its conclusions are misleading, and its examples are wasteful of space.

Every topic hinted at there and more is coveredd much more effectively in this blog. :-)

 

This post brought to you by "ோ" (U+0bcb, a.k.a. TAMIL VOWEL SIGN OO)
(One of those pesky 'nonp-ASCII' languages that you don't have to worry about!)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/03/04 String Indexing?

2006/12/06 What's wrong with what FxCop does for globalization, Part 0.5 (a segue)

go to newer or older post, or back to index or month or day