Every character has a story #29: U+1000^H^H^H^H0f40, (TIBETAN or MYANMAR LETTER KA, depending on when you ask)

by Michael S. Kaplan, published on 2007/08/28 03:59 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/28/4605786.aspx


So I was chatting with Goldie the other day and I think just after or maybe it was just before I made some ridiculous stretch of a joke joke about Anatevka (forgetting momentarily that she did not go by Golde; her nom de plume was Goldie) she asked me if there was a test case I knew off the top of my head where collation results changed between XP and Server 2003.

Interestingly, this is a question I have been waiting years for someone to ask, ever since I first pieced together the change that happened! :-)

You see, prior to Server 2003, there was no version support. You know, those functions I mentioned in posts like this one, (IsNLSDefinedString and GetNLSVersion.

As a part of the Server 2003 update, a bunch of code points got removed from the table. I'll list a bunch of them and you tell me if you see a pattern:

0x1000  32    2   2  2  ;Tibetan Ka
0x1001  32    3   2  2  ;Tibetan Kha
0x1002  32    4   2  2  ;Tibetan Ga
0x1003  32    5   2  2  ;Tibetan Nga
0x1004  32    6   2  2  ;Tibetan Ca
0x1005  32    7   2  2  ;Tibetan Cha
0x1006  32    8   2  2  ;Tibetan Ja
0x1007  32    9   2  2  ;Tibetan Nya
0x1008  32   10   2  2  ;Tibetan Reversed Ta
0x1009  32   11   2  2  ;Tibetan Reversed Tha
0x100a  32   12   2  2  ;Tibetan Reversed Da
0x100b  32   13   2  2  ;Tibetan Reversed Na
0x100c  32   14   2  2  ;Tibetan Ta
0x100d  32   15   2  2  ;Tibetan Tha
0x100e  32   16   2  2  ;Tibetan Da
0x100f  32   17   2  2  ;Tibetan Na
0x1010  32   18   2  2  ;Tibetan Pa
0x1011  32   19   2  2  ;Tibetan Pha
0x1012  32   20   2  2  ;Tibetan Ba
0x1013  32   21   2  2  ;Tibetan Ma
0x1014  32   22   2  2  ;Tibetan Tsa
0x1015  32   23   2  2  ;Tibetan Tsha
0x1016  32   24   2  2  ;Tibetan Dza
0x1017  32   25   2  2  ;Tibetan Wa
0x1018  32   26   2  2  ;Tibetan Zha
0x1019  32   27   2  2  ;Tibetan Za
0x101a  32   28   2  2  ;Tibetan Aa
0x101b  32   29   2  2  ;Tibetan Ya
0x101c  32   30   2  2  ;Tibetan Ra
0x101d  32   31   2  2  ;Tibetan La
0x101e  32   32   2  2  ;Tibetan Sha
0x101f  32   33   2  2  ;Tibetan Reversed Sha
0x1020  32   34   2  2  ;Tibetan Sa
0x1021  32   35   2  2  ;Tibetan Ha
0x1022  32   36   2  2  ;Tibetan A
0x1026   1    0   3  0  ;Tibetan Vowel Sign I
0x1027   1    0   4  0  ;Tibetan Vowel Sign Short I
0x1028   1    0   5  0  ;Tibetan Vowel Sign U
0x1029   1    0   6  0  ;Tibetan Vowel Sign E
0x102a   1    0   7  0  ;Tibetan Vowel Sign O
0x102b  32   37   2  2  ;Tibetan Chuchenyige
0x102c  32   38   2  2  ;Tibetan Visarga
0x102e   1    0   8  0  ;Tibetan Anusvara
0x102f  32   39   2  2  ;Tibetan Right Brace
0x1030   1    0   9  0  ;Tibetan Under Ring
0x1031  32   40   2  2  ;Tibetan Ditto
0x1033  32   41   2  2  ;Tibetan Single Ornament
0x1034  32   42   2  2  ;Tibetan Shad
0x1035  32   43   2  2  ;Tibetan Tseg
0x1036   1    0  10  0  ;Tibetan Candrabindu
0x1037   1    0  11  0  ;Tibetan Candrabindu With Ornament
0x1038  32   44   2  2  ;Tibetan Comma
0x1039  32   45   2  2  ;Tibetan Rinchanphungshad
0x103a  32   46   2  2  ;Tibetan Rgyanshad
0x103b   1    0  12  0  ;Tibetan Honorific Under Ring
0x103c  32   47   2  2  ;Tibetan Left Brace
0x103d   1    0  13  2  ;Tibetan Vowel Sign Ai
0x103e   1    0  14  2  ;Tibetan Vowel Sign Au
0x1040  12   16  70  2  ;Tibetan Digit Zero
0x1041  12   47  70  2  ;Tibetan Digit One
0x1042  12   66  70  2  ;Tibetan Digit Two
0x1043  12   84  70  2  ;Tibetan Digit Three
0x1044  12  102  70  2  ;Tibetan Digit Four
0x1045  12  121  70  2  ;Tibetan Digit Five
0x1046  12  140  70  2  ;Tibetan Digit Six
0x1047  12  158  70  2  ;Tibetan Digit Seven
0x1048  12  176  70  2  ;Tibetan Digit Eight
0x1049  12  194  70  2  ;Tibetan Digit Nine
0x104a  32   48   2  2  ;Tibetan Double Shad
0x104b   1    0  15  0  ;Tibetan Virama
0x104c   1    0  16  0  ;Tibetan Lenition Mark

The problem here? The data is all wrong!

This version of Tibetan, first described in Unicode Technical Report #2, was removed in Unicode 1.1 when the ISO 10646 merger happened, and then Tibetan was added back in Unicode 2.0 in an entirely different place.

If you look at DerivedAge.txt, you will see that the new Tibetan was added in July 1996.

But Windows had been carrying data around from Unicode 1.0 since the very beginning of its 32-bit life, possibly as far back as NT 3.5 or even NT 3.1 (I am almost curious enough to go try and find out which, actually!).

In Server 2003, it was decided that this incredibly invalid data had to be removed.

For one thing, it is just really bad to start a formal versioning functionality with crap like that in there.

And for another, this space that was left empty after the 1.1 merge was actually filled as of Unicode 3.0 in 1999 -- with the Myanmar script. And even though Windows did not add weights for it yet (we did not do so until Vista), keeping known bad data seemed like a pretty bad idea...

So, all of the above code points had weight in Windows from the early 32-bit days until XP, and then again in Vista (and were essentially weightless in the years between).

And of course the snapshots in Jet 4.0, ACE (the version of Jet that ships with Access >= 2007), SQL Server 7.0, 2000, and 2005 all have these somewhat bogus code points as well....

Oops for them (plus we can be snotty and superior about it now that is fixed in Windows!)

When one talks to old timers about the 1.1 merge between Unicode and ISO 10646, you have trouble getting a straight answer -- it is like that bit from The Number of the Beast:

I've given up trying to find out what happened in 1965: "The Year They Hanged the Lawyers." When I asked a librarian for a book on that year and decade, he wanted to know why I needed access to records in locked vaults. I left without giving my name. There is free speech -- but some subjects are not discussed....

So that is all I can say about the old U+1000 TIBETAN LETTER KA which died in Unicode in the early 1990s only to rise from its ashes in 1996 at U+0f40 with U+1000 being assigned to MYANMAR LETTER KA in 1999. The same character lived on at Microsoft until 2003, only to be reborn along with its Myanmar cousin in Vista....

 

This post brought to you by and က (U+0f40 and U+1000, a.k.a. TIBETAN LETTER KA and MYANMAR LETTER KA)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/12/07 If it isn't really Tibetan, could it pinch hit for Burmese?

2007/10/29 Microsoft is a Form 'C' shop, Part 1

2007/10/08 A&P of Sort Keys, part 12 (aka Han sorts first!)

2007/09/23 The lasting effects of interns, aka Can you fix my Vista install, aka Can they blow the Shofar, aka I should have split up this post!

go to newer or older post, or back to index or month or day