A Microsoft convention for compressions in sorting

by Michael S. Kaplan, published on 2005/07/17 11:25 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/17/439742.aspx


The other day, a developer named Stephanie sent me an email about compressions (these are used in collation when two or more characters are given a single sort weight -- the Unicode Collation Algorithm calls their analagous construction a contraction, in part to avoid confusion with other meanings of the term compression that are described in Unicode). She had just read Dr. International's description of the difference between Traditional and Modern Spanish here, and asked:

I did some experimentation and found that I saw the described results for CH, Ch, and ch, although the article only mentions CH. In any case, cH is not included. Can you explain these two discrepencies?

Also, why wouldn't one of these be an alternate sort?

Stephanie, you are right -- every compression we define for a cased script we handle the UU, UL, and LL forms, but we skip the LU form.

This was originally a point of confusion for me as well, but Cathy Wissink set my straight back in the early days when she pointed out to me that words may be ALL CAPS or they may be all lowercase and they may be Initial caps, but there is in most languages not a pattern that has capital letters in the middle of text that is not capitalized. The convention we use for compressions is designed to take this reality into account and handle the expected cases while discarding the one that is unexpected.

The Dr. International article isn't wrong here, though. I will often speak of a compression by just naming the one form when I mean all three forms; it is just a convenient way to express what compressions exist for a language, or a particular sort within a language.

As to your final concern, I agree with you -- there ought to be an alternate sort used here. I actually even pointed this out in the past (described here). The truth is that alternate sorts did not exist then. They were added specifically in the postmortem over handling this issue with Spanish!

 

This post brought to you by "ש" (U+05e9, a.k.a. HEBREW LETTER SHIN)


# Maurits [MSFT] on 18 Jul 2005 3:08 PM:

"there is in most languages not a pattern that has capital letters in the middle of text that is not capitalized"

I can think of two counter-examples off-the-cuff...

1. Ronald McDonald <-- cD
2. random-capitalization for password complexification

The first is not much of a problem as combining characters that happen to meet in this fashion should probably *not* be combined as they are from separate semantic objects.

The second is more of a problem but is mitigated somewhat by the "why would you need to sort passwords anyway" question. Unless you're writing a crypt() function.

# Michael S. Kaplan on 18 Jul 2005 3:15 PM:

ah, the #2 case is not meaningful for us, and the #1 case I would argue that the intent was not to treat the two chars as a sort element (if there were a CD compression, that is; no one has one now).

Now if some language wanted it, we could always add it for that language. We just don't have any right now.... :-)

# Maurits [MSFT] on 20 Jul 2005 12:40 PM:

Suppose I was a developer in Spain, tasked with creating a phone directory. In .NET, of course. I use the CompareString culture options to implement a Spanish-sensitive sort, in particular sorting ch as a unique element between c and d.

CA CB ... CE CF CG CI CJ CK ... CZ
CH
D

cA cB ... cE cF cG cH cI cJ cK ... cZ
d

Fine. I complete the work and go on my merry way.

After I'm long gone, a shipment of Scottish soldiers arrives at the UK's port of Gibraltar. This ship is full of people with last names like McHenry, McTavish, McDonald, ...

Some of these lads decide to leave Gibraltar and intermingle with the native señoritas, marrying and starting families. And getting phone numbers, and listings in the directory I created.

All very well and good. My algorithm even sorts McHenry correctly between McDonald and McTavish!

But then one day someone decides to uppercase all the data...

Pobre Sra. McHenry! She now sorts differently...

McDonald
McHenry
McTavish

vs.

MCDONALD
MCTAVISH
MCHENRY

¡Ay, dios mio!

# Michael S. Kaplan on 20 Jul 2005 2:20 PM:

Well, this was indeed a colorful spelling out of the underlying scenario. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/03/09 Coloring outside the lines in the a-ness of the Hungarian Technical Sort

2008/01/25 On reversing the irreversible (grabbing the data, part I)

2007/09/16 A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)

2007/06/18 If you don't always preserve case, you don't always preserve meaning

2007/01/06 Sorting The Old New Thing All Out

2006/05/26 Custom Case Mappings?

2006/04/27 The disunification of Norwegian and Danish sorting

2005/11/26 Technically it *is* a hungarian sort

go to newer or older post, or back to index or month or day