by Michael S. Kaplan, published on 2007/08/28 03:59 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/28/4605786.aspx
So I was chatting with Goldie the other day and I think just after or maybe it was just before I made some ridiculous stretch of a joke joke about Anatevka (forgetting momentarily that she did not go by Golde; her nom de plume was Goldie) she asked me if there was a test case I knew off the top of my head where collation results changed between XP and Server 2003.
Interestingly, this is a question I have been waiting years for someone to ask, ever since I first pieced together the change that happened! :-)
You see, prior to Server 2003, there was no version support. You know, those functions I mentioned in posts like this one, (IsNLSDefinedString and GetNLSVersion.
As a part of the Server 2003 update, a bunch of code points got removed from the table. I'll list a bunch of them and you tell me if you see a pattern:
0x1000 32 2 2 2 ;Tibetan Ka
0x1001 32 3 2 2 ;Tibetan Kha
0x1002 32 4 2 2 ;Tibetan Ga
0x1003 32 5 2 2 ;Tibetan Nga
0x1004 32 6 2 2 ;Tibetan Ca
0x1005 32 7 2 2 ;Tibetan Cha
0x1006 32 8 2 2 ;Tibetan Ja
0x1007 32 9 2 2 ;Tibetan Nya
0x1008 32 10 2 2 ;Tibetan Reversed Ta
0x1009 32 11 2 2 ;Tibetan Reversed Tha
0x100a 32 12 2 2 ;Tibetan Reversed Da
0x100b 32 13 2 2 ;Tibetan Reversed Na
0x100c 32 14 2 2 ;Tibetan Ta
0x100d 32 15 2 2 ;Tibetan Tha
0x100e 32 16 2 2 ;Tibetan Da
0x100f 32 17 2 2 ;Tibetan Na
0x1010 32 18 2 2 ;Tibetan Pa
0x1011 32 19 2 2 ;Tibetan Pha
0x1012 32 20 2 2 ;Tibetan Ba
0x1013 32 21 2 2 ;Tibetan Ma
0x1014 32 22 2 2 ;Tibetan Tsa
0x1015 32 23 2 2 ;Tibetan Tsha
0x1016 32 24 2 2 ;Tibetan Dza
0x1017 32 25 2 2 ;Tibetan Wa
0x1018 32 26 2 2 ;Tibetan Zha
0x1019 32 27 2 2 ;Tibetan Za
0x101a 32 28 2 2 ;Tibetan Aa
0x101b 32 29 2 2 ;Tibetan Ya
0x101c 32 30 2 2 ;Tibetan Ra
0x101d 32 31 2 2 ;Tibetan La
0x101e 32 32 2 2 ;Tibetan Sha
0x101f 32 33 2 2 ;Tibetan Reversed Sha
0x1020 32 34 2 2 ;Tibetan Sa
0x1021 32 35 2 2 ;Tibetan Ha
0x1022 32 36 2 2 ;Tibetan A
0x1026 1 0 3 0 ;Tibetan Vowel Sign I
0x1027 1 0 4 0 ;Tibetan Vowel Sign Short I
0x1028 1 0 5 0 ;Tibetan Vowel Sign U
0x1029 1 0 6 0 ;Tibetan Vowel Sign E
0x102a 1 0 7 0 ;Tibetan Vowel Sign O
0x102b 32 37 2 2 ;Tibetan Chuchenyige
0x102c 32 38 2 2 ;Tibetan Visarga
0x102e 1 0 8 0 ;Tibetan Anusvara
0x102f 32 39 2 2 ;Tibetan Right Brace
0x1030 1 0 9 0 ;Tibetan Under Ring
0x1031 32 40 2 2 ;Tibetan Ditto
0x1033 32 41 2 2 ;Tibetan Single Ornament
0x1034 32 42 2 2 ;Tibetan Shad
0x1035 32 43 2 2 ;Tibetan Tseg
0x1036 1 0 10 0 ;Tibetan Candrabindu
0x1037 1 0 11 0 ;Tibetan Candrabindu With Ornament
0x1038 32 44 2 2 ;Tibetan Comma
0x1039 32 45 2 2 ;Tibetan Rinchanphungshad
0x103a 32 46 2 2 ;Tibetan Rgyanshad
0x103b 1 0 12 0 ;Tibetan Honorific Under Ring
0x103c 32 47 2 2 ;Tibetan Left Brace
0x103d 1 0 13 2 ;Tibetan Vowel Sign Ai
0x103e 1 0 14 2 ;Tibetan Vowel Sign Au
0x1040 12 16 70 2 ;Tibetan Digit Zero
0x1041 12 47 70 2 ;Tibetan Digit One
0x1042 12 66 70 2 ;Tibetan Digit Two
0x1043 12 84 70 2 ;Tibetan Digit Three
0x1044 12 102 70 2 ;Tibetan Digit Four
0x1045 12 121 70 2 ;Tibetan Digit Five
0x1046 12 140 70 2 ;Tibetan Digit Six
0x1047 12 158 70 2 ;Tibetan Digit Seven
0x1048 12 176 70 2 ;Tibetan Digit Eight
0x1049 12 194 70 2 ;Tibetan Digit Nine
0x104a 32 48 2 2 ;Tibetan Double Shad
0x104b 1 0 15 0 ;Tibetan Virama
0x104c 1 0 16 0 ;Tibetan Lenition Mark
The problem here? The data is all wrong!
This version of Tibetan, first described in Unicode Technical Report #2, was removed in Unicode 1.1 when the ISO 10646 merger happened, and then Tibetan was added back in Unicode 2.0 in an entirely different place.
If you look at DerivedAge.txt, you will see that the new Tibetan was added in July 1996.
But Windows had been carrying data around from Unicode 1.0 since the very beginning of its 32-bit life, possibly as far back as NT 3.5 or even NT 3.1 (I am almost curious enough to go try and find out which, actually!).
In Server 2003, it was decided that this incredibly invalid data had to be removed.
For one thing, it is just really bad to start a formal versioning functionality with crap like that in there.
And for another, this space that was left empty after the 1.1 merge was actually filled as of Unicode 3.0 in 1999 -- with the Myanmar script. And even though Windows did not add weights for it yet (we did not do so until Vista), keeping known bad data seemed like a pretty bad idea...
So, all of the above code points had weight in Windows from the early 32-bit days until XP, and then again in Vista (and were essentially weightless in the years between).
And of course the snapshots in Jet 4.0, ACE (the version of Jet that ships with Access >= 2007), SQL Server 7.0, 2000, and 2005 all have these somewhat bogus code points as well....
Oops for them (plus we can be snotty and superior about it now that is fixed in Windows!)
When one talks to old timers about the 1.1 merge between Unicode and ISO 10646, you have trouble getting a straight answer -- it is like that bit from The Number of the Beast:
I've given up trying to find out what happened in 1965: "The Year They Hanged the Lawyers." When I asked a librarian for a book on that year and decade, he wanted to know why I needed access to records in locked vaults. I left without giving my name. There is free speech -- but some subjects are not discussed....
So that is all I can say about the old U+1000 TIBETAN LETTER KA which died in Unicode in the early 1990s only to rise from its ashes in 1996 at U+0f40 with U+1000 being assigned to MYANMAR LETTER KA in 1999. The same character lived on at Microsoft until 2003, only to be reborn along with its Myanmar cousin in Vista....
This post brought to you by ཀ and က (U+0f40 and U+1000, a.k.a. TIBETAN LETTER KA and MYANMAR LETTER KA)
referenced by