by Michael S. Kaplan, published on 2005/05/10 12:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/10/416008.aspx
Raymond Chen is going to be developing a Chinese dictionary over the next while. This is a really cool project that I am definitely keeping an eye on, for a lot of reasons, one of which is related to my prior blather about how IMEs have it easy. Dictionaries have some of the same conundrums as we do in collation, depending on what they do with words that use the same ideograph but have different pronunciations....
Anyway, in today's post, Loading the dictionary, part 1: Starting point, there was one bit that jumped out at me:
Since it was the Big5 dictionary we downloaded, the Chinese characters are in Big5 format, known to Windows as code page 950. Our program will be Unicode, so we'll have to convert it as we load the dictionary. Yes, I could've used the Unicode version of the dictionary, but it so happens that when I set out to write this program, there was no Unicode version available. Fortunately, this oversight opened up the opportunity to illustrate some other programming decisions and techniques.
Now this is very interesting, but what it made me think about was all of the work that the government in the PRC has been doing to provide pronunciation data for ideographs (by some reports over 60,000 of the over 70,000 ideographs in Unicode/GB18030 -- many of them actually traditional rather than simplified!), wondering how well the Big5 code page can really handle Chinese, since it contains mappings for less than 20,000 ideographs.
As I said in this post, almost every language needs Unicode these days for full representation. But that assumes data suitable for a dictionary is available. Is it? That is something I do not know.
Now a dictionary is perhaps in a special category since by its very nature it does not need to be limited to the commonly known ideographs -- in theory, it may have data well beyond what people may commonly know. But what does that mean in practice? If an English dictionary contained many Latin script characters I did not recognize, what would I do? I'd probably buy a different dictionary.
How does one draw the line when one has that many ideographs to deal with? One may end up with a set not too much bigger than the Big5 data provides, for everyday usage.
Can code pages in some cases be used as 'repetoire fences' that help keep us inside the list of typically used characters, even if we are using Unicode for the actual work? Such an architecture would allow the flexibility to add words outside of the code page when you need to, something that could be crucial depending on what you are doing....
This post brought to you by "字典" (U+5b57 U+5178, which I believe means 'Dictionary')
# CornedBee on 10 May 2005 4:29 PM:
# Michael S. Kaplan on 10 May 2005 4:44 PM:
# Michael S. Kaplan on 10 May 2005 10:51 PM:
# Alex on 10 May 2005 11:01 PM:
# Michael S. Kaplan on 10 May 2005 11:31 PM:
referenced by
2007/02/28 What do they mean when they say 'GB18030 Characters' ?
2007/02/24 Using a character proposal for a 'repertoire fence' extension
2005/12/07 Some sorts resist the future
2005/05/12 Thinking beyond the BMP of Unicode
2005/05/10 A better question -- what is the performance, Everett vs. Whidbey?
2005/05/10 More on 'repetoire fences'