Traditional to Simplified or vice-versa? According to Windows, you're on your own....

by Michael S. Kaplan, published on 2007/10/22 10:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/22/5588441.aspx

Stephen asks via the Contact link:

I'm making a program doing a Traditional/Simplified Chinese conversion in Delphi. However, all the web pages shown in Google search results are LCMapString. I did try using this LCMapString. LCMapString can do a mapping of 26xx characters. However, I find that it can't do a correct conversion mapping from Traditional to Simplified one.

I wrote a simple routine to convert a list of the 26xx Traditional Chinese Characters. Then compare the result to a list of the characters in Simplified Chinese. The LCMapString can only convert 22xx characters. Is there any document in the MSDN mentioned this bug?

Stephen
from Hong Kong

This is not a bug.

Those two mappings that LCMapString provides via the LCMAP_SIMPLIFIED_CHINESE and LCMAP_TRADITIONAL_CHINESE that I discussed in LCMapString's *other* job:

LCMAP_SIMPLIFIED_CHINESE -- Maps traditional Chinese characters to simplified Chinese, passing through other characters unchanged. Thus 樂 (U+6a02) becomes 乐 (U+4e50). The dictionary used for this mapping is small (only 2,620 ideographs) and has not been updated since the feature was added in NT 4.0 (it was originally added at the request of people in Office, who actually ended up going with their own more sophisticated dictionary solution in Word that does a better job with the sometimes complicated mapping. Now although casing, width, and Kana mappings can all be done in place, this is not allowed for traditional->simplified Chinese mappings, even though the same restrictions (always the same length, etc.) apply here -- if any NLS testers who are reading this want to put in a bug, someone could see about fixing that!

LCMAP_TRADITIONAL_CHINESE -- Maps simplified Chinese characters to traditional Chinese, passing through other characters unchanged. Thus 儈 (U+5108) becomes 侩 (U+4fa9). The dictionary used for this mapping is even smaller (only 2,191 ideographs) since there are many times that several traditional Chinese ideographs will map to one simplified ideograph (thus these two flags are not 100% reversible versions of each other). The table has not been updated since the LCMAP_SIMPLIFIED_CHINESE one was. Same problems with in-place update apply here -- if any NLS testers who are reading this want to put in a bug, it will be resolved as a duplicate of the other bug I was suggesting, above!

Notice those counts in there -- that is all that exists in the tables provided by Win32.

As of Unicode 4.01 (the last time I looked in to the matter) the Unihan.txt data file provided in the Unicode Character Database does not provide all that larger of a list (2629 simplified mappings including some from Extension B that Windows doesn't have and 2554 traditional mappings with analogous additional entries.

Though clearly Microsoft has better mappings, like the ones in Microsoft Office I mentioned:

but I don't know of any source available to programmers from Microsoft. Though people like Raymond Chen have mentioned other, external sources for doing the conversion in the past.

And if you look around on the web you can find lots of implementations in Perl and other languages, and some additional data tables to support the work.

When you consider the problems that face people in relation to IDN and traditional/simplified mappings, it really seems like this problem should be something that Windows does better.

A bit of trivia:

most of the simplified Han used by these two mappings are not in Windows code page 950;
all of the simplified Han used by these two mappings are in Windows code page 936;

most of the traditional Han used by these two mappings are not in Windows code page 936;

all of the traditional Han used by these two mappings are in Windows code page 950;

LCMapStringA converts to and from Unicode using the ACP of the locale you pass in, which can only amount to 936 or 950 and not both.

This makes LCMapStringA (rather than LCMapStringW) almost entirely useless for these two mappings, even though minimal effort to simply assume the code pages to use (936 or 950) based on the mapping direction would fix that. This problem has existed for every version of Windows since these flags have been supported, so I guess people aren't eager to make changes in this space....

This post brought to you by 傧 and 儐 (U+50a7 and U+5110, two Han ideographs with a relationship you can probably discern from context)

# Scott on 24 Oct 2007 4:45 PM:

I think many people who are trying to find a simple function to map between simplified and traditional characters may be looking for the wrong tool for the job. The problem is that markets that use Traditional characters may have very different vocabulary than that used in markets that use Simplified characters. Even if you could deal with the many-to-one problem, in most real world cases, converting between traditional and simplified Chinese is more like translation than case mapping.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/10/22 We weren't Vista heroes, but I think we were kinda heroic

go to newer or older post, or back to index or month or day