Overheard in microsoft.public.win32.programmer.international

by Michael S. Kaplan, published on 2005/03/31 19:51 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/03/31/404370.aspx

It is an interesting question, one that people often ask for different languages when they start understand what is the cost of going to Unicode. That cost can be related to the extra performance hit of strungs that are twice as big, the hit of conversion any time you need to move into or out of Unicode. It is entirely reasonable toask the question Ernst did, to determine if there is a real benefit.

Starting with the question about Big5, it is an industrial standard out of Taiwan that roughly maps to the portion of the CNS-11643 standard that was picked by industry.

Looking at Unicode today (as of the just releaseed Unicode 4.1), there are 71,226 Han ideographs. The latest updates to CNS-11643 seem to suggest that over 56,000 of them are in one sense or another attested in Taiwan.

Looking at the Big5 code page on Windows, there are 20321 characters in it (and not all of them are ideograhs; that includes the ASCII stuff some other single-byte stuff, and some of the Kana characters).

This seems to give the best possible answer to whether there are drawbacks other than being restricted to Chinese -- how about being restricted in how much Chinese you can use here, too?

You can see similar problems with over 20,000 ideographs in Korean National standards, over 50,000 missing from code page 936 for China, and thousands missing from code page 932 for Japanese.

And to a somewhat lesser extent, the same is true of almost every code page. Vietnamese uses all of the following combining marks, only some of which actually have representation in the code page: grave, hook above, tilde, acute, dot below, breve, circumflex, and horn. This was probably less of an oversight as the fact that there is not enough room.

Try asking Dr. International. As he pointed out in "Arabic: Script or Language?" there is certainly a lot of the Arabic script that could not be fit into code page 1256. There are many languages, including Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Kurdish, Pashto, Sindhi, Uighur, Urdu, and others, that the code page simply has no room for. They need Unicode, too.

Of course it goes without saying that the many "Unicode only" languages that have been added to Windows, starting with Windows 2000 and continuing in Windows XP and XP SP2, that clearly require Unicode, unless your users speak fluent question mark.

And it happens with many other languages, too. And the code pages that are nominally underneath them.

The cost of not moving to Unicode is getting higher and higher all the time. The time to move to Unicode is not even today, it is last week!

This post brought to you by "ێ" (U+06ce, a.k.a. ARABIC LETTER YEH WITH SMALL V)
One of the many Arabic script characters that exist in Unicode that are not in code page 1256.

"being restricted in how much Chinese you can use here, too"
Indeed.
Most programmers and web page designers in Taiwan still have no idea what Unicode is about, and whenever they encounter a character that's not encoded in Big5, they think Windows doesn't support it.
What a shame!

The sad thing is that there are standards like Shift_JIS-2004 (part of JIS X 0213) and Big5 with HKSCS that provide support for the missing chars while continuing to use DBCS, but they cannot be supported in Windows as the ACP as at least some of the characters convert to surrogate characters!

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.