In a much better position to handle inserts

by Michael S. Kaplan, published on 2007/07/22 09:51 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/07/22/3998892.aspx

One of the common complaints that localizers have when it comes to localizability of software projects comes into play when developers have strings with inserts that they run through the C runtime.

(I'll assume that Unicode is being used but that is often a bad assumption; a topic for another day, I'm sure!)

If you have a string with more than one "fill in at runtime" insert in it, and that string has to be localized, then there is almost certainly going to be at least one language (and usually several) where the word order of the target language is different enough that there is no way to produce the properly localized string without sounding like one is retarded or does not know the language.

But luckily I got to point out to that localizer that there was a fix for this limitation available as of the VS 2005 version of the C Runtime -- positional insert parameters!

Which means that most of the developers work replacing one set of functions with the other set requires minimal thought, and since the old functions had an express input parameter ordering, it is a trivial (and entirely automatable!) operation to update the format strings.

As a fun bonus, you are allowed to re-use parameters by just repeating their insert numbers (previously you'd have to pass the same parameter more than once), which is also quite trivial and automatable. This last part is entirely optional but is kind of a cool extra feature that one gets for free when you are allowed to specify the order of parameters....

The one surprise for me was the 1-based indexes, but that is akin to the Win32 FormatMessage function (previously the only way to do this sort of thing). The syntax of the inserts is also slightly different than FormatMessage but I think I can live with that since so much effort to stay compatible with the existing functions was going on. :-)

So, let's shout it from the roof tops so all of the developers using the CRT can hear -- they can now be in the position to have localizable string inserts!

Now the next step might be to suggest this change to the C standard folks so that this code can be more portable (which is always important), but for now I'll just be very happy with the fix to this long-standing limitation in Microsoft's CRT....

In any language where verbs or adjectives are inflected for gender, number and/or tense, the translated phrase will sound or seem retarded.

Example: English has two grammatical numbers, singular and plural. The first, quick-and-dirty prototype version sidesteps this by displaying “42 file(s)”. Then, as the core functionality is complete, the well-intentioned but grossly misguided developer says, “This ‘(s)’ is retarded. No reason I wouldn’t be able to distinguish singular from plural.”

He then writes some code, finds out that singular does not mean exactly one but also any number 10*k+1 except for 100*m+11, for all non-negative integer k and m; then, that not all plurals go with just -s but some go with -es and a few rare ones have completely unrelated words for plural (sock/soxen, mouse/meeces, comma/commata, mongoose/polygoose etc.); and then that zero is a special case that looks better as “No files found” rather than “0 files found”.

All that should alert the more seasoned developer that he is missing something, but “well, it’s no rocker science”, right?

Then someone proposes to translate the UI into Japanese. Japanese has one grammatical number. “This two localizable string, what for is?”* — “You see, English has two forms, one used when referring to a single object, another when talking of multiple objects.” — “Ah, understood. Same text let’s put.”* Problem solved — but in a way that introduces a duplication which should also alert one that something is wrong.

Afterwards, a Russian localization is started. Russian pretends to have two grammatical numbers, singular and plural, but actually also has dual, which is used for 10*k+{2, 3, 4} except for 100*m+{12, 13, 14}. Now, if the localizer is a native speaker, he/she asks, “And why only two strings for numbers, and not three? How we will express 2, 3 and 4?”* Problem solved, by adding more linguistic knowledge into the code.

Or, if the localizer is lazy and ignorant, users get the plural where the dual should be, as in “42 fajlov” instead of “42 fajla”. Or if the localizer is not so ignorant, we get “42 fajla(ov). Problem solved, by reintroducing the original problem. Same balls, flank view.

Further complications arise when the correct phrasing is gender-specific and no gender knowledge is available in the application context. Let’s take for example the attribution line in an email reply template:

> On (date/time), John Smith** wrote:

> On (date/time), Jane Johnson** wrote:

English has no gender inflection for verbs. No problem, let’s hardcode the template, just leave three inserts for the date, two for the time and one for AM/PM, and maybe one for the day of the week, and two for their first and last name.

Japanese has no gender inflection at all. No problem. Right?

Bzzzt! Wrong on two accounts. First, Japanese use the logical order for date components, YYYY-MM-DD, as opposed to the American MM/DD/YYYY and European DD.MM.YYYY. Second, they write the last name first, again putting the primary sort key before the secondary one. So, let’s switch to positional inserts.

Now here go the real problems. Russian has gender for verbs (but only in the past tense) and adjectives and most everything else. So, “(date/time) Ivan Kuznetsov** pisal”, but “(date/time) Marija Petrova** pisala”. Now we have to add a gender field to our address book, and face the problem of what gender to use when the person being quoted is not in it. Also, the months, when written as part of the date, are in the genitive case. Luckily, Windows has a special flag that allows us to get month names in genitive. And, while we are at it, in formal speech, it is customary to address those above you by name and patronymic, and refer to them by name, patronymic and surname. Good luck explaining to your American- or European-centered client what the hell patronymic is, when and where to use it. (Japanese have it easy: In mail, address everyone as -sama, even if personally you are on “you bastard” terms.)

All in all, localization of strings that are outside the scope of menus and dialog controls is a problem which requires more linguistic knowledge to be added to the code, wants more input from the user (is someone by the nick of Angel whom you’ve never met a male or a female (if any)?), looks retarded (“Alexander Sidorov** pisal(a)”), or any non-empty combination of the above.

___

* Localizers’ phrases are intentionally distorted to mimic the grammar of their respective languages.

** Names are chosen to look like typical names in their respective cultures, but are otherwise fictional. No relation to any people, living or dead, is intended.

@Kemp

Okay, so in English singular is only used for exactly 1. This demonstrates my point even better.

First, you hardcode "wcout << n << (n == 1 ? singular : plural)" into your numbered-objects-output function.

Then, when it’s time to localize for Russia, your language experts say that, despite all the common sense, in Russian all of {21, 31, …, 91, 101, 121, … } go with singular. Do you rewrite the function to "wcout << n << (n == 1 || language == LANG_RUSSIAN && n % 10 == 1 && n % 100 != 11) ? singular : plural"?

Then there is this strange dual number that goes with numbers ending in 2, 3 or 4. Except for any numbers in ranges 100*k+11 through 100*k+14, which are always plural no matter what. What do you do? Do you rewrite the function to read "wcout << n << (n % 100 between 11 and 14 ? plural : n % 10 == 1 ? singular : n % 10 in (2, 3, 4) ? dual : plural)", adding a new parameter to the function? What if later you have another language that also has trial number (used for quantities of exactly 3 or maybe b*k+3, where b is the number base primarily used in that language)?

I hereby posit that, in order to allow for non-retarded localization, the localization module must include an interpreter of a scripting language such as ECMAScript or Common LISP, that would be constrained enough that malicious localizers couldn’t take over your computer, but expressive enough to implement the declination rules for any given language. This way, all the language-specific knowledge can be isolated in the localization add-on for that language.