The Cantonese IME (not for input of characters from Canton, Ohio)

by Michael S. Kaplan, published on 2006/07/27 00:01 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2006/07/27/679538.aspx


Last month I was talking about how Feature ideas don't always turn out to be good ones. And I mentioned how I'd probably talk about other cases in the future.

What can I say besides welcome to the future. :-)

In Vista, from the time when it was just Longhorn, there has been enhanced collation support for all of the CJK locales. The stroke count sorts and Mandarin pronunciation (both Pinyin and Bopomofo) sorts all covered more characters, the Korean Hangul pronunciation sort was enhanced too, and the Japanese locale got a new alternate sort to cover everything in JIS X 0213. Basically a lot of work was done.

But there was one area that was not covered that was really bothering me -- there was no support for a Cantonese sort of any kind.

"But isn't Cantonese," you might ask, "a spoken dialect, not a written one?"

The Wikipedia article Written Cantonese gives a good answer to this question in its introduction:

Written Cantonese refers to the written language used to write colloquial standard Cantonese using Chinese characters.

Cantonese is usually referred to as a spoken variant, and not as a written variant. Spoken vernacular Cantonese is different from standard written Chinese, which is essentially formal Standard Mandarin in written form. Written Chinese spoken word for word in Cantonese sounds overly formal and distant. As a result, the necessity of having a written script which matched the spoken language increased over time. This resulted in the formation of additional Chinese characters to complement the existing characters. Many of these represent phonological sounds not present in Mandarin. A good source for well documented written Cantonese words can be found in the scripts for Cantonese drama and Cantonese opera.

With the advent of the computer and standardization of character sets specifically for Cantonese, many printed materials in predominantly Cantonese spoken areas of the world are written to cater to their population with these written Cantonese characters. As a result, mainstream media such as newspapers and magazines have become progressively less conservative and more colloquial in their dissemination of ideas. Generally speaking, some of the older generation of Cantonese speakers regard this trend as a step "backwards" and away from tradition. This tension between the "old" and "new" is a reflection of a transition that is taking place in the Cantonese speaking population.

And if you look at the major population centers with people who use Cantonese, there are clear efforts to support this development among many of the native speakers (and writers) of Cantonese.

There are some cultural issues that even I was faced with when doing research here that I will discuss further in a follow-up post....

Of course one of the big problems has been that there are multiple romanizations used to represent the pronunciations, and unfortunately they are often used in the same lists (like phonebooks in Macau and elsewhere that allow people to simply enter the pronunciation -- how can you hope to sort the phone book consistently if the people providing the pronunciations have different ideas of how even identical pronunciations are to be represented?

But lots of work has been done to try to help with this issue, for example the Jyutping system produced by the Linguistic Society of Hong Kong (LSHK). And many people have been trying to use it -- for example the government of the Hong Kong SAR's Chinese Language Interface Advisory Committee (CLIAC) has produced the Cantonese Pronunciation List of the Characters for Computers, a huge set of data providing Cantonese "Pinyin-esque" style pronunciations for much of the Hong Kong Supplemental Character Set (HKSCS).

When I first saw that we would have a list of over 30,000 ideographs and their pronunciations, I was excited -- perhaps this data could be used to provide a Cantonese sort for the people in Hong Kong and elsewhere who wanted it?

But unfortunately, while there is much that is good about Jyutping, it has one liability at present, one that it shares with Yale and other romanization systems: and that is that there are several romanization systems. And there is not yet one that is ubiquitous.

Another problem that exists is that for the 30,764 unique ideographs given pronunciations in the CLIAC-provided doc, there are less than 2,000 unique pronunciations (less than 700 if you do not include the tone values).

And yet another problem is in the decision about tones -- some number the tones in Cantonese at nine, while others claim that three of these are unimportant distinctions and that there are only six to worry about. So it is not just different romanization systems, which vary enough with place names like Canton and Guangzhou coming from the same word, but even if people agree on the romnization they may differ on their opinion of the tones (with some believing that tones 7, 8, and 9 actually fold into 1, 3, and 6 respectively).

And the final problem, there is not yet a clear and established standard on how to break ties -- once you decide which Han have the same pronunciation, how do you decide which one comes first?

There was just not enough of a consensus yet to try to push ahead in Windows with providing such a sort. Because Microsoft has no interest in dictating language policy; we just want to identify it so that we can represent things the way customers would like them.

But this now brings us to input methods.

Like I said way back in December of 2004, IMEs have it easy. In this case because (if for no other reason) if you identify a rich new source of pronunciations you can simply add them to the IME if you like them. Or you can provide different IMEs using the different systems, too (assuming you have enough data!).

Anyway, enough of the backstory, right? Let's get to the IME, like I said I would!

The steps are the same as they were with the Unicode IME. Just grab the file from here (871 kb) or you can grab the zipped version here (144 kb).

1) Copy the text file to \Program Files\Windows NT\TableTextService on your Vista machine (if the "Program Files" on your machine is another language, use that directory, do not create a new one!).

2) Open an elevated command prompt and navigate to that directory.

3) Run the following from that command prompt:

rundll32 TableTextService.dll RegisterProfile TableTextServiceCantonese.txt

4) Say OK to the dialog that comes up verifying you want to install it:

You can now add the Chinese Hong Kong Cantonese IME to the Chinese (Hong Kong S.A.R.) locale by going through the following steps that are illustrated here.

Now like the Unicode IME this is a sample, and further this is a work in progress. There are lots of things I would like to do to tweak settings here, like as in how/if the list should be sorted, for example.

(And if I find other huge caches of Cantonese pronunciations in other romanizations I might even see whether they could be productively combined.)

And like I said, in an upcoming post I will talk about many of the cultural issues I ran across while doing the research here -- they are fascinating!

 

This post brought to you by 䕫 (U+2f9b2, an Extension B ideograph in HKSCS with a Jyutping pronunciation of kwai4)


# SDiZ on Thursday, July 27, 2006 5:08 AM:

> Canton and Guangzhou

Canton is "廣東", while "Guangzhou" is "廣州".
They are not the same.

# b6s on Saturday, July 29, 2006 8:40 PM:

OpenVanilla (http://openvanilla.org/) and other input methods that support .cin table already have a Jyutping input method. OpenVanilla even has a Win32 beta that supports intelligent Jyutping to select proper homophones automatically.

# Michael S. Kaplan on Sunday, July 30, 2006 11:12 AM:

I always suspected that there are indeed methods to grab the proper ideograph here -- so it is very cool to know that there is progress here. :-)

# b6s on Friday, August 04, 2006 10:38 AM:

Dear Mr. Kaplan,
Thank you. :)
Although we are still using old IMM32 API, not new TSM one, there's always a chance getting better.
For development notes, documents, and the most important, examples, we currently have only Traditional Chinese version here: http://svn.openfoundry.org/openvanilla/trunk/ , so it's a pleasure to read your blog entry about Cantonese input method development in English.

# jenbsookjen@yahoo.com on Thursday, August 17, 2006 10:45 AM:

Hi, I am looking for a Cantonese IME that I can use and came across this article. How does it work? And also, is this only work for Windows NT since I have windows XP

Su

# Michael S. Kaplan on Thursday, August 17, 2006 1:10 PM:

It only works with Vista, at the moment....

# Andy 美國土子 on Thursday, October 05, 2006 4:40 PM:

The wikipedia bit your pulled is correct, but not properly applied.

Mandarin is also a dialect too, as well Cantonese. However, Mandarin is the now the national dialect. what does differ is the grammar between the two dialects. You can use Cantonese to read, speak, and write colloquial vernacular cantonese, Standard Chinese and standard Formal Cantonese.

but just because Written Cantonese refers to the vernacular Cantonese, it doesn't mean that one would not use a Cantonese input system to write in Standard Chinese. Chinese characters are chinese characters regardless of dialect.

There are even differences between Standard Chinese and vernacular Mandarin, but the differences are not as dramatic as in Cantonese depending on which type of cantonese yo are using at the moment.

# Ksec on Wednesday, October 18, 2006 1:05 PM:

I dont know if there is a bug or not. But your method described to register a new IME does not work on my Vista RC2.

And i am wondering if there are anymore info on those variable inside TableTextService. A search on google does not return anything useful.

Thanks in advance

# David Oftedal on Friday, January 11, 2008 9:47 AM:

Ksec, are you absolutely sure that you're using an elevated command prompt (That is one with admin privileges) and not a regular one? If not, right-click on the command prompt and choose to run it with admin privileges. That solved it for me.

# John Cowan on Monday, January 21, 2008 2:34 PM:

The (simplified) story with the tones is this:

Middle Chinese had four tones, conventionally named ping, shang, qu, ru.  Ru tone was used only for syllables that ended in a stop: p, t, or k.

In each of the modern dialects, one or more of these tones split into two tones, conventionally called the yang and the yin varieties, mostly on the basis of whether the original syllable began with a voiced stop, r, or l (yin) or not (yang).  Voiced stops have been lost in the Chinese languages (other than Shanghainese, which is radically different and may not be a tone language any more).

In Mandarin, ping tone split into modern tones 1 (yang) and 2 (yin), shang tone became modern tone 3, and qu tone became modern tone 4.  Ru tone disappeared when Mandarin lost all final stops, and the syllables were redistributed among the other tones.

In Cantonese, the story is way more complicated.  All four of the old tones split, and what's more, ru tone split *twice*.  Consequently, ping tone became modern tones 1 (yang) and 4 (yin), shang tone became modern tones 2 (yang) and 5 (yin), qu tone became modern tones 3 (yang) and 6 (yin), and ru tone became modern tones 7 and 8 (yang) and 9 (yin).  That's your nine tones, which represent a full structural analysis.

However, except for the final stop making the syllable shorter, tones 7, 8, and 9 are pronounced exactly like tones 1, 3, and 6 (some older speakers still pronounce 1 falling and 7 level, but for most they are both level now).   This is why on the phonetic level the nine structural tones are reduced to just six.

In addition, there are two "changed tones" which have no counterparts in the other dialects, and which signal a variant meaning of the basic word (unlike other tones, which have no individual meanings any more than a vowel or a consonant has).  So structurally there are really 11 tones, but the changed tones are phonetically just lengthened versions of 1 (or 7) and 2, leaving us once more with six.

# Mui on Tuesday, January 22, 2008 12:12 AM:

I tried it but I got an DLL error "Error loading TableTextService.dll The specific module could not be found."  How do I get the TableTextService.dll?

# Michael S. Kaplan on Tuesday, January 22, 2008 3:03 PM:

In Vista? It is built-in. Pre-Vista this solution will not work.

# Mui on Monday, January 28, 2008 12:17 AM:

I am already using Vista but still getting the "Error loading..." message.  Is there something missing in my Vista?

# Michael S. Kaplan on Monday, January 28, 2008 9:38 AM:

It should be in the same directory where the instructions ask you to place the text file on your machine. More likely you have placed the file in the wrong place....

# Chien on Wednesday, February 06, 2008 1:12 PM:

I have the same problem as the above.

# Michael S. Kaplan on Wednesday, February 06, 2008 2:40 PM:

See above -- be in an elevated command prompt, move to the directory indicated which has the DLL.

# Andy on Monday, February 11, 2008 2:06 AM:

I was able to register but then when I go down to add the keyboard, I can't find the Chinese HK SAR under the Chinese HK SAR selection, it only give me the traditonal and simplify layout, any ideas?  

Thanks.

# Ricki on Thursday, March 06, 2008 4:23 PM:

it only work in notepad and nths else anythings Does it only work like that or i did somethings missing  

i'm using vista 64bit

# Michael S. Kaplan on Thursday, March 06, 2008 4:57 PM:

I don't know where it is failing for you to, so I probably cannot comment meaningfully. But you could try one of the other text-based TIPs to see if these apps fail to accept any like input?

# Ricki on Friday, March 07, 2008 12:08 PM:

Oh I got it all sort out now  I didn't put it in Program Files (x86) folder I have to manual install it two time one is 64bit and one is 86    

It works in IE now and notepad

Thanks for this  blogs none of any canton input work on 64bit

# Michael S. Kaplan on Friday, March 07, 2008 1:07 PM:

Aha, that explains it!

Probably worthy of its own blog post to mention this explicitly....

# Kitty on Friday, March 14, 2008 9:34 PM:

is the phonetic cantonese ime available for xp?

# Michael S. Kaplan on Friday, March 14, 2008 10:22 PM:

The technology I used is Vista and above, only....

# Steph on Sunday, April 13, 2008 8:55 PM:

Hey, I tried this on my Vista, but it doesn't seem to work, do I have to restart the computer?

Thanks,

# Gabriel on Wednesday, April 23, 2008 8:55 AM:

Hi, thank you for your method. but unfortunately it doesn't work with my Windows Vista Ultimate. Try install and run 2 times, restarted.....still no good to show up in the list. Please advice a suggestion to fix!. great thanks

# Michael S. Kaplan on Wednesday, April 23, 2008 10:57 AM:

You didn't say asnything about what happened, so there is no way to advise just yet?

The most common sources of failure are wrong directory (not with the others) and not being in an elevated command prompt.

# Gabriel on Thursday, April 24, 2008 6:10 AM:

All the above steps has been followed and nothing show up incorrect. but once I open the "Add Input Language" box, and went to Hong Kong SAR, there are only US comes up. Already try to restart computer or even re-try the whole installation process, still no good for 2 times. Any idea why would this happen? Greatly appreciate for your help.

OS: Windows Vista Ultimate

# Michael S. Kaplan on Thursday, April 24, 2008 8:25 AM:

You are 100% positive that you are running in an elevated command prompt?

Are you running on 64-bit? If so you have to register it twice -- once for 32-bit and once for 64-bit (look up a few comments further).

# xx on Wednesday, July 02, 2008 1:25 AM:

I was getting the error message too, despite having done (almost) everything correct.

Turns out my error was from entering the rundll.. command in lower case letters. I don't know how to copy and paste into the command prompt, so I manually typed in "rundll32 tabletextservice.dll registerprofile tabletextservicecantonese.txt" instead of "rundll32 TableTextService.dll RegisterProfile TableTextServiceCantonese.txt". Command prompt is case sensitive (never knew that..)

# Michael S. Kaplan on Wednesday, July 02, 2008 1:43 AM:

Just the "RegisterProfile" keyword -- it is case sensitive in the DLL's registration code.

# Poshi on Sunday, July 20, 2008 8:31 PM:

I followed your instruction and inserted this entry into the registry. However, I don't see this entry under Chinese Traditional (HK).  Do you have any recommendations? Thanks

# Michael S. Kaplan on Sunday, July 20, 2008 9:05 PM:

What registry entry are you talking about? I did not mention any registry entries....

# Leroy Vargas on Sunday, August 17, 2008 8:02 PM:

This worked for me:

Right-click on Command Prompt and choose to run as Administrator.

On the command prompt, first type:

cd \"Program Files\Windows NT\TableTextService"

then you can run that rundll32 command to register the IME.

If using Vista x64, repeat the same but with "Program Files (x86)" instead of "Program Files".

# Kei on Tuesday, September 02, 2008 6:05 PM:

Does it work in all versions of Vista? I got home premium addition and i've successfully installed it, but it doesnt show up on my add input language. Any Suggestions?

# Michael S. Kaplan on Tuesday, September 02, 2008 7:06 PM:

Yes, it works in all versions.

Are you running on x64/IA64? For those, you have to run it for both 32-bit ans 64-bit, as Leroy Vargas indicated.

If not then other common mistakes include not running from an elevated command prompt...

# Lenny Li on Monday, November 03, 2008 6:57 AM:

i appreciate the people who put this website together to offer hope of cantonese input in vista since the previous working ones that i used are broken in vista.

however, i followed the installation instructions and tested inputting a few common chinese characters, and run into problem

for instance, i cannot input the character lee by typing lee (whereas it used to work with the xp version of input or if you test it on cantoneseinput.com)

so is this vista version still work in progress? i cannot count on something that works partially......i want to have the original cantonese input that used to work in xp to have it on vista.

# Michael S. Kaplan on Monday, November 03, 2008 10:05 AM:

One good thing about the format of TableTextServiceCantonese.txt being a text file is that you can see the whole thing and all of the entries it has. This will give you insight into what pronunciation scheme was used for the data, which can obviously be incomplete.

The IME as it stands is incomplete because I don't currently have another source of data I can easily use to add information to it, but if and when I do get more, I do plan to update it (plus someone can update it themselves if they have a source of data, whether or not they choose to offer the data to me as well).

# Kelvin on Sunday, March 15, 2009 7:32 PM:

How do I remove this IME? because I no longer need this

# Michael S. Kaplan on Monday, March 16, 2009 7:57 PM:

Kelvin,

I can't be an email support line, sorry (referring to your other message).

You can unregister the TIP with an almost identical command line, just replace RegisterProfile with UnregisterProfile.


referenced by

2013/11/13 Mandarin vs. Cantonese

2011/03/15 Making TableTextService work for both 32-bit and 64-bit on a 64-bit Windows...

2008/06/21 Back to Sri Lanka (conceptually)

2008/03/17 If we sorted Bopomofo like we do Pinyin, would it still be considered "Traditional" Chinese?

2008/02/23 The triage process gives me hives

2008/01/21 Behold the Table Driven Text Service, Part 0 (You have to start somewhere!)

2007/02/04 So how does that Naqittaut keyboard work, exactly?

2006/09/17 And we are the knights who say நீ (NII)

2006/08/18 Creation of transliterating input methods

go to newer or older post, or back to index or month or day