Extending the MS Transliteration Utility

by Michael S. Kaplan, published on 2006/08/19 06:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/08/19/707013.aspx


Regular reader KJK:Hyperion asked in the Suggestion Box:

...when will Transliteration Utility support Romaji and Hiragana transliteration for Japanese? That's basically the only one I need. At the moment I use http://www.j-talk.com/nihongo/ but I'd prefer an off-line tool.

The tool that he is referring to is the Microsoft Transliteration Utility v1.0, which Thierry Fontenelle talks about in English here (and in French here).

I happened to be in an email thread with one of the authors of the tool (Nick Cipollone) and Thierry, and I figured I'd ask them this question. :-)

And Nick gave me the scoop:

Our basic strategy with Transliteration Utility was just to get the thing out the door with a few representative types of modules that people could use as models to create their own.  The only modules that were specifically requested by anyone were the Inuktitut Syllabary <-> Romanization modules (requested by the Canadian sub), the rest were basically things we had lying around.

We had intended to put out “module expansion packs” every now and then, once we had enough new modules to justify it.  We haven’t developed any new ones for public consumption since Transliteration Utility shipped in January, though.  We also hoped as a stretch goal that individuals or companies other than Microsoft might eventually provide module expansion packs, although this hasn’t happened to our knowledge yet either.

Well, that sounds like a call to arms for me, what do all of you think? :-)

The tool itself is a pretty cool thing, and it may be worth looking into building a new transliteration model in its Module Development Console:

The text in the Module Development Console lays out what is involved, and it looks pretty straightforward (all you would need is good knowledge of the languages and the transliteration in question to fill it in!):

[Input]
// Insert a several-word description of the module's input.
// For example:
//     Romanization

[Output]
// Insert a several-word description of the module's output.
// For example:
//     Cyrillic

[Description]
// Give a several-sentence description of the module.

[Preprocess]
// If you need to preprocess your input before applying
// rules specify the procedure here. 
// For example:
//     ToLower
//     ToUpper(tr-TR)

[States]
// If you need any states other than the two predefined ones
// (START and DEFAULT) then declare their names here. 
// For example:
//     CONSONANT
//     VOWEL

[FollowingContextMacros]
// Insert any following context macro definitions here.
// For example:
//     Cons        b c d f g h j k l m n p q r s t v w x y z
//     ConsOrEnd   <END> :Cons:
//     Vowel       a e i o u
//     VowelAtEnd  a<END> e<END> i<END> o<END> u<END>

[EscapeSpanDelimiters]
// If you need to be able to prevent spans of the input
// from being processed you can specify one pair of strings
// to indicate the beginning and end of such escaped spans.  
// For example:
//     {   }
//     /*  */

[Rules]
// List your rules here.  For example:
//             a          --> x
//             a(<END>)   --> y
//     [START] fa         --> z [VOWEL]

Anyone want to give it a shot? :-)

 

This post brought to you by (U+3071, a.k.a. HIRAGANA LETTER PA)


# Nektar on 19 Aug 2006 1:23 PM:

Can it be used to transform Greek to Greenglish (Greek written with the Latin alphabet) and vice versa. The problem with Greek is that many Greek vowels or vowel combination might be written in Greenglish with only a signle representation and thus when Greenglish is transformed back to Greek spelling might be wrong.
Can Translitteration Utility help?

# Michael S. Kaplan on 19 Aug 2006 1:40 PM:

Well, most likely yes. It may be worth trying the definition here, I think. :-)

The key (like in similar cases between Traditional and Simplified Chinese) is that although you can map multiple forms to a single form when going in one direction, when going in the other direction you can only choose one (unless there is a way to define more complex rules with surrounding text).

# dennispg on 20 Aug 2006 7:52 AM:

I posted a comment over in the original post about this utility too..

where are the modules that are built in to the utility located? it would sure be nice to be able to use one of those as an example to work from...

# Michael S. Kaplan on 20 Aug 2006 9:45 AM:

I believe they are built in. But I'll see if I can get one of them or something else that could act as a sample....

# Nick Cipollone on 20 Aug 2006 2:04 PM:

The modules that ship with the tool live in \Program Files\Common Files\Transliteration\Modules\Microsoft\*.tms (with help files having matching names with extension *.htm).  3rd parties can create their own system-level expansion packs by putting *.tms (and, optionally, *.htm) files in a sister directory.  Transliteration Utility will pick them up on the next launch; you don't need to do anything to register them.  The name of directory is considered the publisher and is displayed in parentheses after the module name in Transliteration Utility.  (E.g., "Hiragana to Romaji Scheme IV (Translitcorp)".)  

When you import a single module from the Options | Manage Modules... dialog it goes into your private module library, which lives in Documents and Settings\<USER>\Application Data\Transliteration.  

Among the shipping modules, the Serbian & Bosnian modules are the simplest and most nearly one-to-one/context-free.  The Inuktitut modules are also simple but take context into account in a pretty trivial way.  The Malayalam modules are much more complex and illustrate how to use more sophisticated contextual constraints.  The last shipping module, the Hangul --> Romanization module, just demonstrates the built in "HangulLinearization" feature that allows you to bypass the authoring of individual rules when deconstructing precomposed hangul syllables.  

All aspects of the rule system are described in detail in the Module Development Console help file.  (This is a separate help file from the main Transliteration Utility help file, which just describes how to use modules, not how to create them.)

# Michael S. Kaplan on 20 Aug 2006 9:02 PM:

I stand corrected -- there they are!

(And a lot of good info from the tool's creator, too)

# Patrick Hall on 22 Aug 2006 2:37 PM:

This looks really cool. I've been working on a similar thing in Javascript (for situations where the user can't install keyboards, etc).

The idea of a transliteration "rule language" is very appealing to me. I think it would be really awesome if there were some way to help standardize such a language in such a way that the rules could be shared between applications. (XML, maybe?)

The only other tool I know of that has extensive transliteration support is (I think) IBM's ICU stuff, but I don't know if there's a language definition in there or not.

# Michael S. Kaplan on 22 Aug 2006 2:46 PM:

I can't speak with knowledge about ICU's efforts here, as all I know about it is what has been presented from time to time at Unicode conferences. Perhaps others can do more along the lines of comparing the two as frameworks for doing transliteration work?

# Jonathan T. Capes on 23 Oct 2006 2:53 PM:

I happened upon the transliteration tool and after finding Nick's post above, I made my own module to convert Uzbek Cyrillic to the Roman alphabet Uzbekistan adopted in 1995, which should have become the official standard in 2005.

I haven't yet looked at creating a module for Roman --> Cyrillic.  I know that there would be some fairly major issues going in that direction as Cyrillic --> Roman was not a lossless process.

I would be more than happy to provide the module to anyone who could use it.  I also created a much more friendly Uzbek keyboard layout, using QWERTY as a basis, making it much more intuitive for QWERTY users to input Uzbek in Cyrillic.

I'll check back here or you can email me at

capes at u dot washington dot edu


referenced by

2007/05/13 Keyboards that map any language to any other language, or the lack thereof

go to newer or older post, or back to index or month or day