A&P of Sort Keys, part 4 (aka It isn't a race but let's make an EXCEPTION and cross the Finnish line)

by Michael S. Kaplan, published on 2007/09/14 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/14/4905625.aspx


Previous posts in this series:

So far, I have been using the default table for every sort key.

And I don't want to knock the default table -- it is good enough for like 80 different locales on Vista.

But of course that does mean there are quite a few that it is not good enough for....

The way this works is simple -- every locale in Windows has a pointer to an EXCEPTION table, a pointer that for about 80 of them is NULL. For the rest, it is a list of exceptions with the new weights that should be used....

Exceptions are very powerful -- just using the information we have so far:

The net effect is the ability to handle the fact that every person in the world intuitively understands the meaning of alphabetical order even though many of those intuitions conflict with each other....

So looking at the weights of i, j, x, and y in English versus Lithuanian is:

en-US  i  0e 32 01 01 01 01 00
en-US  j  0e 35 01 01 01 01 00
en-US  x  0e a6 01 01 01 01 00
en-US  y  0e a7 01 01 01 01 00

lt-LT  i  0e 32 01 01 01 01 00
lt-LT  j  0e 35 01 01 01 01 00
lt-LT  x  0e a6 01 01 01 01 00
lt-LT  y  0e 33 01 01 01 01 00

See how that y moved between the i and the j for Lithuanian? Just slid right into the space....

Or (to take another case), how about the w and the v in Finnish vs. English?

en-US  v  00000409 0e a2 01 01 01 01 00
en-US  w  00000409 0e a4 01 01 01 01 00
fi-FI  v  0000041d 0e a2 01 01 01 01 00
fi-FI  w  0000041d 0e a2 01 03 01 01 01 00

See how the w in Finnish moves to just have a secondary distinction?

The same thing could have been done in Swedish, but I figured since the Swedish Academy wants to see that change (as I pointed out here) I won't go out of my way to show the behavior they want to see changed. :-)

As for moving in the other direction, I have shown this in the past in Polish vs. the default table in You can't ignore diacritics when a language does not give them diacritic weight.

Microsoft has never yet made a character that had weight into something weightless by simply changing locales, but we have gone in the other direction on one occasion....

It was back in early 2005, I posted about it in Doing a little more in Sri Lanka. We were in difficult place since those Sinhalese characters had no weight, but Cathy Wissink and I were brainstorming since we knew we had to have some weights so that lists would be able to sort but couldn't figure out how to do it without breaking everything. Then one of us had a great idea (I cannot remember who, we both had a lot of bad ideas prior to this good one I am about to mention), wondering whether we could give these characters weight but only in the exception table, even if they had no weight in the default table.

It was a very novel idea for us, and one that had honestly never been done before in Windows. But it was being done in SQL Server (then in beta, I believe) as you moved between the older collations and the newer ones, with the same potential problems (if you switched your settings to an older collation, then some characters would suddenly not be able to be sorted). So in the end we did some testing to look at the behavior and after some review by us and others and some native speakers, we did it.

There was no backcompat issue since no existing locale's sort was changed, and given the fact that the people the patch was really intended for would be unlikely to be changing their user locale settings while still needing to sort the data (a fact we probably couldn't assume generically), we decided to go for it. Very unique circumstances, and we decided to make it less likely to be needed in the future by adding all of Unicode 5.0 in Vista (since the circumstances are unlikely to recur). But it is possible....

And now we kind of come to the end of the post, though there will be another tomorrow. Stay tuned!

 

This post brought to you by 4 (U+0034, DIGIT FOUR)


no comments

referenced by

2010/09/13 Olive, the other reindeer, gets to Sort it all Out too....

2008/08/21 A&P of Sort Keys, part 14: The Hangul is really getting OLD

2008/07/06 You must have heard wrong, Jesse\ I don't know about tailoring\ But about the algorithm Jesse\ That is used by Microsoft...

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/24 A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

2007/09/21 A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

2007/09/20 A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

2007/09/18 A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)

2007/09/17 A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you)

2007/09/16 A&P of Sort Keys, part 6 (aka Relax, be calm, and deCOMPRESS if you are feeling out of sorts)

2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

go to newer or older post, or back to index or month or day