by Michael S. Kaplan, published on 2005/07/17 11:25 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/17/439742.aspx
The other day, a developer named Stephanie sent me an email about compressions (these are used in collation when two or more characters are given a single sort weight -- the Unicode Collation Algorithm calls their analagous construction a contraction, in part to avoid confusion with other meanings of the term compression that are described in Unicode). She had just read Dr. International's description of the difference between Traditional and Modern Spanish here, and asked:
I did some experimentation and found that I saw the described results for CH, Ch, and ch, although the article only mentions CH. In any case, cH is not included. Can you explain these two discrepencies?
Also, why wouldn't one of these be an alternate sort?
Stephanie, you are right -- every compression we define for a cased script we handle the UU, UL, and LL forms, but we skip the LU form.
This was originally a point of confusion for me as well, but Cathy Wissink set my straight back in the early days when she pointed out to me that words may be ALL CAPS or they may be all lowercase and they may be Initial caps, but there is in most languages not a pattern that has capital letters in the middle of text that is not capitalized. The convention we use for compressions is designed to take this reality into account and handle the expected cases while discarding the one that is unexpected.
The Dr. International article isn't wrong here, though. I will often speak of a compression by just naming the one form when I mean all three forms; it is just a convenient way to express what compressions exist for a language, or a particular sort within a language.
As to your final concern, I agree with you -- there ought to be an alternate sort used here. I actually even pointed this out in the past (described here). The truth is that alternate sorts did not exist then. They were added specifically in the postmortem over handling this issue with Spanish!
This post brought to you by "ש" (U+05e9, a.k.a. HEBREW LETTER SHIN)
# Maurits [MSFT] on 18 Jul 2005 3:08 PM:
# Michael S. Kaplan on 18 Jul 2005 3:15 PM:
# Maurits [MSFT] on 20 Jul 2005 12:40 PM:
# Michael S. Kaplan on 20 Jul 2005 2:20 PM:
referenced by
2010/03/09 Coloring outside the lines in the a-ness of the Hungarian Technical Sort
2008/01/25 On reversing the irreversible (grabbing the data, part I)
2007/09/16 A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)
2007/06/18 If you don't always preserve case, you don't always preserve meaning
2007/01/06 Sorting The Old New Thing All Out
2006/05/26 Custom Case Mappings?
2006/04/27 The disunification of Norwegian and Danish sorting
2005/11/26 Technically it *is* a hungarian sort