Putting your ducks in a row

by Michael S. Kaplan, published on 2004/12/18 02:13 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/18/323941.aspx

It has fascinated me for most of the last eight years, since the first time I saw Appendix D of the first edition of Developing International Software for Windows 95 and Window NT. Like most people, I knew that some languages had different letters but it had just never occurred to me that letters would ever be ordered differently.

What I find most fascinating is that by and large people understand the order well enough to make use of it to retrieve information or to recognize when it is out of order. Yet they usually cannot clearly articulate the rules even when they have a clear subconscious knowledge of it.

Like all good internationalization features, people really only notice them if they do not meet expectations. This is something else that I find fascinating because I have never felt like it was all about me, even if I am doing cool stuff. It is about the cool stuff.

There reason that languages have ordering is obvious -- people need to be able to find information. How else can they find words in a dictionary or names in a phone book, if there is not a deterministic ordering that they are expecting?

The principles used to create the orderings vary, depending on the language and to some extent the script. I will give some examples here....

Other languages simply stick additional letters at the end or next ones that look similar. They may use points of articulation, or phonemic values.

More often than not, people know neither the whys nor the wherefores. But they know the right order when they see it....

In future posts I'll look into some of the more interesting trends I have noticed in some of the languages of East Asia, Southeast Asia, and South Asia.

There is another issue related to sorting that may cause some surprises. It is not how the order of certain letters in a certain context is 'the wrong way around', but how pairs of 'distinct' letters are treated as equal: one thing that struck me as odd when I moved to Finland five years ago was that the 'V' and 'W' sort as equivalent letters here (at least in dictionaries and phone books). For instance, in the phone book you will see people named 'Vahlberg' and 'Wahlberg' listed intermixed, sorted by their first name. If I understand correctly, the same is true in Sweden.

On a totally unrelated matter, I was surprised that the armenian and georgian characters of your post are rendered correctly in my browser. I looked into the source of the HTML page and didn't find any explicit mention of the font to use (no embedded font neither). The only mentioned font (the classical Verdana,... set) doesn't support Armenian and Georgian chars.
I assume IE (and Firefox) are smart enough to look for a suitable font when the ones mentioned can't display a character (In my case, it found sylfaen).
Now that I write it, I seem to remember that (as far as IE is concerned) this is MLang's job ? Is it right or did I miss some boat ?

> For any one ideograph there can be only one
> count although the number of strokes counted
> may vary

The number of strokes counted varies with the number of strokes written, and there can be more than one, even in a single language. If you mean that, in a sort ordering in which one of the keys is the number of strokes, one of the numbers has to be chosen in setting the key, that is correct. But in doing lookups (both in printed dictionaries and in computerized systems) there have to be cross-references from each of the other locations where the character would be found with other accepted stroke counts.

> Thus in Japanese the pattern A (あ) I (い)
> U (う) E (え) O (お) is repeated with
> successive "rows" of the alphabet like
> Ka (か) Ki (き) Ku (く) Ke (け) Ko (こ)

Yes. (There are varying styles on a few rows.)

> and Ga (が) Gi (ぎ) Gu (ぐ) Ge (げ) Go (ご).

No, those don't get rows by themselves. They are part of the Ka row.

> When one asks a native speaker about the
> AIUEO order the same blank stare can often
> be seen.

If one asks why the column order is AIUEO instead of some other column order, then sure. (A likely reason is that some early orthographers based the ordering on Sanskrit but so what.) If one asks why this ordering has columns and rows instead of something less, um, orderly, then an answer is more likely. This is the logical ordering.

There exists a completely different ordering based on a poem by a famous poet, using exactly once each of the syllables that existed in the Japanese language of the time. The i-ro-ha ordering is not ordinarily used in computer systems but can be used in various human systems, including everything from designating varieties in a mail-order catalog to labeling points made in a written submission in a court case. (I just tried inputting a couple of paragraphs to Word 2000 and it even supports that ordering automatically. I'd be surprised if Excel would sort in that order though.)

> Other languages simply stick additional
> letters at the end or next ones that look
> similar.

Or both. After Latin stuck X and Z on the end instead of putting them near where Greek put them, English inherited that. But then English put J near I instead of near the end, and English put Y near the end instead of near I. Consistency is the last refuge of an uncreative programmer.