Machine translation is not EASY

by Michael S. Kaplan, published on 2004/12/04 03:10 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/04/274926.aspx


The title is right -- machine translation is not easy.

There are a lot of reasons for this, some of which you may already know....

There is the obvious problem -- how do you make a machine understand language? Most of us cannot really even describe it ourselves even though we do usually understand it, and this is a problem that is not necessarily made easier if you are a linguist, it just means that you probably have a better handle on the scope of the problem.

This description seems as little too simplistic, though. Why couldn't we just have big companies or governments throw money at the problem? Well, we are and they do. and although a lot of progress has been made, there are certain core concepts that have not been understood.

A key concept here is that of semantics and pragmatics. The semantics basically refers to the explicit information like the grammar and definitions of words and sentence structure, and the pragmatics refers to the implicit knowledge that the two people on opposite sides of a communication attempt have (These are vastly oversimplified definitions which would probably make a real linguist shudder, but they are good enough for our purposes). Obviously the line between these two concepts is often blurry; the easy cases are obvious and the borderline cases have good arguments in both directions whether they are explicit or implicit. Different experts draw the line between them differently, to the point where sometimes people do not understand why they bothered trying to make the distinction in the first pkace.

I'll make a suggestion that seems slightly novel to me but probably is not really novel at all. It is that when referring to machine translation, the semantics are the pieces that the machine can get a handle on given time and resources, while the pragmatics are for the most part not. If you have enough information programmed in about the grammar of a language and the word definitions then you may be able to tackle semantic problem, such that a machine can take a sentence, parse it into a tree and so forth. The part of the problem covered by pragmatics, however -- the implicit knowledge that the speaker of a communication and the person listening to it must have in order to assure understanding is a lot harder to get done. Our ability to make a machine understand the context enough to communicate an issue effectively in another language is pretty poor.

Look at babelfish.com and other auto-translating services. They show the concept in action, at times showing a rather brilliant ability to parse content in one language and present it in another. Yet if I keep on sending it through over and over again:

1  English:      Nobody goes to that club any more because it is way too crowded.
2  French:      Personne ne va désormais à ce club parce que c'est manière trop serrée.
3  English:      Nobody goes from now on to this club because c'est too tight manner.
4  French:      Personne ne va dorénavant à ce club parce que façon trop serrée de c'est.
5  English:      Nobody goes henceforth to this club because too tight way of c'est.
6  French:      Personne ne va dorénavant à ce club parce que manière trop serrée de c'est.
7  English:      Nobody goes henceforth to this club because too tight manner of c'est.
8  French:      Personne ne va dorénavant à ce club parce que façon trop serrée de c'est.
9  English:      Nobody goes henceforth to this club because too tight way of c'est.

We do eventually reach a steady state where what is being communicated is either identical back and forth or regular repeats (in this case, once line 8 duplicated line 4 it became obvious that the repeat would keep happening forever). But it did not take long to lose a whole lot of meaning, did it?

This is obviously much more than just the problem of a copy of a copy of a copy over at the Xerox machine -- there is a sophisticated machine that is parsing quite a bit of the content and yet still missing the point of the sentence that any human would have picked up on quickly.

But even we humans have some doubts here about meanings. Which did I mean with the above sentence?

A) Nobody interesting goes to that club any more because it is way too crowded.

B) Nobody important goes to that club any more because it is way too crowded.

or are both of these variations entirely off base and maybe I was hinting at the fact that people still would go, but they cannot get in because the fire code does not allow more people to be let in? Those subtle shades of meaning that require more context to understand deals with the pragmatic side. Those things are the bits that machine translation mostly does not have a handle on. Even people can have trouble with this sort of issue!

(Of course if the club in question is Largo in Los Angeles then people still try to go so they can see Jon Brion or Aimee Mann or Michael Penn. They just watch the calendar and call ahead for reservations... but that is another story!)

Are there ways around this problem? Of course there are!

For example, if the machine has a huge sample of prior translations then if it can find matches there and then (if the original person translating did have a good handle on the semantics and the pragmatics), the machine can do a more impressive job. But since we believe that language for us is more than just having a lot of old sentences to draw on, we generally have to look at this type of solution as a clever stopgap that is really not adding intelligence to the machine doing the translating).

In the meantime, machine translation is hard. And it probably will be for quite some time....


# Eusebio Rufian-Zilbermann on 4 Dec 2004 11:14 PM:

Good translation is not easy (whether done by a machine or a human being).

Something that can be done to make it easier is reducing the complexity of the language itself. This is feasible with something like manuals, it is possible to write reasonably good manuals using a limited subset of the language (e.g., limited set of verb tenses, simple sentence constructs only, no ambiguous words) and make it easier, for machine translators, for human translators, and for foreigners who read the original version of the manual.

It would be great to see the grammar checker in Microsoft Word extended with a new writing style called "easy to understand", and maybe then it will be possible to create a machine translator that can handle the text that passes the check.

# michkap on 4 Dec 2004 11:31 PM:

Very good point, although it assumes one is looking at the problem as a closed one where one has control over the language of the source material being translated....

Since its not specifically something I am working on (it just interests me a lot!), I usually tend to look at the problem as an open one -- how good can we make it with any arbitrary text to translate.

Plus I am worried about the impact on a software product if such an effort were successful....

referenced by

2004/12/15 Saying all those nouns over and over again...

go to newer or older post, or back to index or month or day