Can you Vai for attention and get in a Tif[inagh] with [Marie] Osman[ya]? N'ko way!

by Michael S. Kaplan, published on 2010/05/25 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/05/25/10014052.aspx

I may have blown the stack on ridiculous puns with this blog. If your stack was blown then please accept my apologies (and be happy there were no other scripts in the font to continue the punning!).

So the other day, friend and colleague across the vast divide that is Unicode Charles Riley contacted me over IM (I must admit Facebook has been a boon for some communications, the key is to be in front of the machine with the right tab showng in the browser and knowing who to answer when their window pops up!).

Anyhow, Charles pointed out a smll oops in Word, and in particular in its famous Insert Symbol... dialog.

It is something he found in looking at the dialog with his interest in African languages....

In particularly the Ebrima font in Windows 7, and the support of N'ko (U+07c0 - U+07ff):

It looks like a simple set of range errors in the code supporting the dialog. Presumably this would be easy to fix, though I couldn't say for sure without looking at the code....

Though unfortunately this bug (which Charles found in Office 2007) also occurs in the recently released Office 2010, as well.

The above screenshots were all from Office 2007 but I verified the same bug is in thew new version as well.

Now Windows and its Character Map are running a slightly different record, with N'ko (U+07c0 - U+07ff) and Tifinagh (U+2d30 - U+2d7f) working:

That blog, incidentally, explains how no one knows how to update the names on the ranges in Character Map, which suggests that either they figured it out and didn't remember to add Vai, or that the support for N'ko and Tifinagh has been in Character Map since XP even though the fonts haven't been there.

I must admit I am not too curious to know which, since the fact that Tiffinagh was added in Unicode 4.1, N'Ko was added in Unicode 5.0, and Vai was added in Unicode 5.1 suggest either psychic abilities or the incomplete update hypothesis. :-)

One of the last things Charles asked me was what I thought the odds for a service pack fix were (for either bug or both).

The fixes for either one before the next major version seem unlikely since the functionality was in no way impacted for either -- just the cool feature of seeing the name in the user interface....

Even in windows 7 there's no N'ko or Tifinagh subrange. Your screenshot doesn't show there being one - it shows character names, which come from getuname.dll [semi relevant blog link: blogs.msdn.com/.../511920.aspx ]

As for the ranges themselves... I did this before, but the trail ran cold at the string ids; I never thought to open up charmap.exe in a resource editor, and it was an old blog anyway so I never posted the comment at the time:

16-byte header: 'UCEX' magic, DWORD offset of name, WORD codepage*, WORD count of ranges, two WORDs unknown purpose.

*it's 1200 in subrange.uce, 932 in ShiftJis.uce, 936 in gb2312, etc, but the character data is always in unicode.

16-byte item per range: DWORD offset of name, DWORD offset of start of character data, DWORD count of characters, DWORD always zero.

Character data itself - couldn't be simpler: a newline character U+000A signifies the "line breaks" you see. Each range's character data begins immediately after the previous one.

Names - there's a list at the end of the file of strings of the form "010100" "010101" "010102" etc. They appear to be string IDs into the string table of charmap.exe[.mui]. If a string not made of digits appears in the file, it looks like it'll use that as the name [but it won't be localizable, obviously]

All that's left are those last two words in the header:

ShiftJis.uce: 0,1; SubRange.uce: 0,1; bopomofo.uce: 4,11; gb2312.uce: 3,9; ideograf.uce: 12,21; kanji_1.uce: 5,10; kanji_2.uce: 0,1; korean.uce: 2,7. Surely someone with access to the charmap source code could figure out what they're for [and whether or not that all-zero dword in the range structure is used for anything], and a tool could be written to build uce files from scratch.

Now, this is assuming that the tool they're built with has in fact been lost - if it's just that it exists but no-one who cares knows what it is - well, that's a different problem, one not as easily solved by an outsider.