A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)

by Michael S. Kaplan, published on 2007/09/16 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/16/4937196.aspx

Previous posts in this series:

Part 0: The empty string sorts the same in every language
Part 1: The law of the letter -- e.g. Latin < Greek < Cyrillic
Part 2: The string that won? Didn't have a mark on him!
Part 3: Should you let a string make it's case? If so, Y?
Part 4: It isn't a race but let's make an EXCEPTION and cross the Finnish line
Part 5: EXPANSIONing your horizons

In that most recently posted Part 5, the core idea was a feature that allowed one code point to be treated as two (or effectively more) sort elements -- what normal people think of as characters.

This post is going to take things in the opposite direction, and talk about the way to get two or more code points to be thought of as a single element.

At Microsoft we call this feature compressions, and we basically keep per unique sort COMPRESSION tables.

As names go, this one is kind of unfortunate; in the Unicode Collation Algorithm they are instead called contractions, which in my opinion is a better name -- since if there is anything one wants to be able to do with sort keys, it is usually to compress them so they take up less space!

On the other hand we don't provide any functionality or documentation or even hints on how one might compress sort keys, so perhaps the terminological overlap is not too much of a worry. :-)

Now as I mentioned yesterday, COMPRESSIONS and EXPANSIONS are mutually exclusive -- if you are in the code path of one, you won't be in the code path of another. This is an unfortunate side effect of the implementation that is being tracked as a bug since it would help a few interesting side cases get fixed if it was not true....

Also, as long as we are collecting COMPRESSION factoids, we have specific convention for how they work with casing, as I mentioned in A Microsoft convention for compressions in sorting -- they can be UU, LL, or UL, but not LU....

Oh, and there is also the fact that we have no "default table" compressions. Now the reason for this is performance -- the code path that uses compressions is slightly slower, so making people whose languages do not have any of these things work a little slower just so languages they don't use sort more intuitively for people who aren't them. :-)

Though as I think about, it would be cool to have an option for a "DEFAULT PLUS" table that included all of the language-specific compressions that did not conflict with each other. With that option most of the locale-specific compressions would go away!

But anyway, let's look at some examples, shall we?

Looking first at the default table:

en-US a 0e 02 01 01 01 01 00
en-US aa 0e 02 0e 02 01 01 01 01 00
en-US å 0e 02 01 1a 01 01 01 00
en-US b 0e 09 01 01 01 01 00
en-US z 0e a9 01 01 01 01 00

And then compare it to Danish (but not Norwegian, due to The disunification of Norwegian and Danish sorting!):

da-DK a 0e 02 01 01 01 01 00
da-DK aa 0e b1 01 03 01 01 01 00
da-DK å 0e b1 01 01 01 01 00
da-DK b 0e 09 01 01 01 01 00
da-DK z 0e a9 01 01 01 01 00

See how aa is being put over at the end of the alphabet, after z? And how it is almost identical to å but with a little extra diacritic weight?

Which I guess points out another factoid -- you can include all of the appropriate case and diacritic weights you want to as well, in the COMPRESSION table entries.

Sometimes, the compression table entries can get quite complicated -- such as some of the Indic scripts, which have at times 4-to-1 or even 5-to-1 compressions. Or Tibetan, which even has some 8-to-1 compressions in it. Though in most languages there are not too many of them.

Another interesting thing you can do is one that we use for languages like Tamil that have a Virama that they use to suppress the inherent vowel, when they consider the letter with the Virama to sort before the letter without it (as I discuss in And then there is the Virama....). You can treat the letter + Virama as one letter, and make it come before the letter without the virama. Like so (in Vista):

ta-IN க் U+0b95 U+0bcd 37 26 01 01 01 01 00
ta-IN க U+0b95 37 28 01 01 01 01 00

With the COMPRESSION feature, such support is easy to deliver (the opposite case where they combine is also easy, but there are other ways that could be made to work).

Which reminds me, I have to point out where we are on the size front:

As described in Part 0, every sort key has a fixed overhead of five bytes per sort key;
According to Part 1, most of Unicode will give you two bytes of primary weight per character;
As described in Part 2, no matter how many diacritics you pile on a character, it will never add more than one byte per character;
As also described in Part 2, the need to pad in order to have the diacritic weights line up can actually add one additional byte per character preceding one with a diacritic.
As described in Part 3, case can work the same way as the last point -- needing padding to let the case weights line up.
As described in Part 4, EXPANSIONS cause the size of a Unicode weight to be between 2 and 3 times bigger (thus 4 or 6 bytes instead of 2 per letter).
As described in this part, COMPRESSIONS can actually shrink the sort key from any of the previous discussions -- in the most extreme cases treating eight code point as if they were a single letter.

And there you have it. Stay tuned for the next exciting collation feature....

This post brought to you by 5 (U+0035, DIGIT FIVE)

# kevinowen on 16 Sep 2007 11:07 AM:

Is it intentional that this post isn’t sponsored by U+0036?

# Michael S. Kaplan on 16 Sep 2007 11:42 AM:

I wanted to see if anyone would notice. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/09/13 Olive, the other reindeer, gets to Sort it all Out too....

2009/09/14 What is impossible for Microsoft can be simply undesireable for Unicode

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2008/01/25 On reversing the irreversible (grabbing the data, part I)

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

2007/09/18 A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)

2007/09/17 A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you)

go to newer or older post, or back to index or month or day