Raymond's Chinese dictionary

by Michael S. Kaplan, published on 2005/05/10 12:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/10/416008.aspx


Raymond Chen is going to be developing a Chinese dictionary over the next while. This is a really cool project that I am definitely keeping an eye on, for a lot of reasons, one of which is related to my prior blather about how IMEs have it easy. Dictionaries have some of the same conundrums as we do in collation, depending on what they do with words that use the same ideograph but have different pronunciations....

Anyway, in today's post, Loading the dictionary, part 1: Starting point, there was one bit that jumped out at me:

Since it was the Big5 dictionary we downloaded, the Chinese characters are in Big5 format, known to Windows as code page 950. Our program will be Unicode, so we'll have to convert it as we load the dictionary. Yes, I could've used the Unicode version of the dictionary, but it so happens that when I set out to write this program, there was no Unicode version available. Fortunately, this oversight opened up the opportunity to illustrate some other programming decisions and techniques.

Now this is very interesting, but what it made me think about was all of the work that the government in the PRC has been doing to provide pronunciation data for ideographs (by some reports over 60,000 of the over 70,000 ideographs in Unicode/GB18030 -- many of them actually traditional rather than simplified!), wondering how well the Big5 code page can really handle Chinese, since it contains mappings for less than 20,000 ideographs.

As I said in this post, almost every language needs Unicode these days for full representation. But that assumes data suitable for a dictionary is available. Is it? That is something I do not know.

Now a dictionary is perhaps in a special category since by its very nature it does not need to be limited to the commonly known ideographs -- in theory, it may have data well beyond what people may commonly know. But what does that mean in practice? If an English dictionary contained many Latin script characters I did not recognize, what would I do? I'd probably buy a different dictionary.

How does one draw the line when one has that many ideographs to deal with? One may end up with a set not too much bigger than the Big5 data provides, for everyday usage.

Can code pages in some cases be used as 'repetoire fences' that help keep us inside the list of typically used characters, even if we are using Unicode for the actual work? Such an architecture would allow the flexibility to add words outside of the code page when you need to, something that could be crucial depending on what you are doing....

 

This post brought to you by "字典" (U+5b57 U+5178, which I believe means 'Dictionary')


# CornedBee on 10 May 2005 4:29 PM:

> by some reports over 60,000 of the over 70,000 ideographs in Unicode/GB18030

Suddenly the 65000 characters in a Windows WCHAR or a VC++ wchar_t seem so little ... when will it be expanded to 32 bits? (Or is that HCHAR - for huge character?)

# Michael S. Kaplan on 10 May 2005 4:44 PM:

Very good question! I will talk about UTF-32 another time.... :-)

# Michael S. Kaplan on 10 May 2005 10:51 PM:

No problem, someone is trying to port it to managed code anyway, for performance tests!

# Alex on 10 May 2005 11:01 PM:

What does it have to do with UTF-32? UTF-8 or UTF-16 can represent all the currently defined planes of 10646. I think it is an encoding or surrogates issue, not a 32-bit issue.

# Michael S. Kaplan on 10 May 2005 11:31 PM:

Hey Alex -- many people consider UTF-8 to be much harder for operations involding code points given the 1/2/3/4 byte nature of it. For the sake of ease of coding, performance, and size issues on Windows it is much easier on Windows to deal with UTF-16 most of the time, and in UTF-32 for select circumstances....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/02/28 What do they mean when they say 'GB18030 Characters' ?

2007/02/24 Using a character proposal for a 'repertoire fence' extension

2005/12/07 Some sorts resist the future

2005/05/12 Thinking beyond the BMP of Unicode

2005/05/10 A better question -- what is the performance, Everett vs. Whidbey?

2005/05/10 More on 'repetoire fences'

go to newer or older post, or back to index or month or day