by Michael S. Kaplan, published on 2007/09/14 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/14/4905625.aspx
Previous posts in this series:
So far, I have been using the default table for every sort key.
And I don't want to knock the default table -- it is good enough for like 80 different locales on Vista.
But of course that does mean there are quite a few that it is not good enough for....
The way this works is simple -- every locale in Windows has a pointer to an EXCEPTION table, a pointer that for about 80 of them is NULL. For the rest, it is a list of exceptions with the new weights that should be used....
Exceptions are very powerful -- just using the information we have so far:
The net effect is the ability to handle the fact that every person in the world intuitively understands the meaning of alphabetical order even though many of those intuitions conflict with each other....
So looking at the weights of i, j, x, and y in English versus Lithuanian is:
en-US i 0e 32 01 01 01 01 00
en-US j 0e 35 01 01 01 01 00
en-US x 0e a6 01 01 01 01 00
en-US y 0e a7 01 01 01 01 00
lt-LT i 0e 32 01 01 01 01 00
lt-LT j 0e 35 01 01 01 01 00
lt-LT x 0e a6 01 01 01 01 00
lt-LT y 0e 33 01 01 01 01 00
See how that y moved between the i and the j for Lithuanian? Just slid right into the space....
Or (to take another case), how about the w and the v in Finnish vs. English?
en-US v 00000409 0e a2 01 01 01 01 00
en-US w 00000409 0e a4 01 01 01 01 00
fi-FI v 0000041d 0e a2 01 01 01 01 00
fi-FI w 0000041d 0e a2 01 03 01 01 01 00
See how the w in Finnish moves to just have a secondary distinction?
The same thing could have been done in Swedish, but I figured since the Swedish Academy wants to see that change (as I pointed out here) I won't go out of my way to show the behavior they want to see changed. :-)
As for moving in the other direction, I have shown this in the past in Polish vs. the default table in You can't ignore diacritics when a language does not give them diacritic weight.
Microsoft has never yet made a character that had weight into something weightless by simply changing locales, but we have gone in the other direction on one occasion....
It was back in early 2005, I posted about it in Doing a little more in Sri Lanka. We were in difficult place since those Sinhalese characters had no weight, but Cathy Wissink and I were brainstorming since we knew we had to have some weights so that lists would be able to sort but couldn't figure out how to do it without breaking everything. Then one of us had a great idea (I cannot remember who, we both had a lot of bad ideas prior to this good one I am about to mention), wondering whether we could give these characters weight but only in the exception table, even if they had no weight in the default table.
It was a very novel idea for us, and one that had honestly never been done before in Windows. But it was being done in SQL Server (then in beta, I believe) as you moved between the older collations and the newer ones, with the same potential problems (if you switched your settings to an older collation, then some characters would suddenly not be able to be sorted). So in the end we did some testing to look at the behavior and after some review by us and others and some native speakers, we did it.
There was no backcompat issue since no existing locale's sort was changed, and given the fact that the people the patch was really intended for would be unlikely to be changing their user locale settings while still needing to sort the data (a fact we probably couldn't assume generically), we decided to go for it. Very unique circumstances, and we decided to make it less likely to be needed in the future by adding all of Unicode 5.0 in Vista (since the circumstances are unlikely to recur). But it is possible....
And now we kind of come to the end of the post, though there will be another tomorrow. Stay tuned!
This post brought to you by 4 (U+0034, DIGIT FOUR)
go to newer or older post, or back to index or month or day