Putting your ducks in a row

by Michael S. Kaplan, published on 2004/12/18 02:13 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/18/323941.aspx


Collation fascinates me.

It has fascinated me for most of the last eight years, since the first time I saw Appendix D of the first edition of Developing International Software for Windows 95 and Window NT. Like most people, I knew that some languages had different letters but it had just never occurred to me that letters would ever be ordered differently.

What I find most fascinating is that by and large people understand the order well enough to make use of it to retrieve information or to recognize when it is out of order. Yet they usually cannot clearly articulate the rules even when they have a clear subconscious knowledge of it.

Like all good internationalization features, people really only notice them if they do not meet expectations. This is something else that I find fascinating because I have never felt like it was all about me, even if I am doing cool stuff. It is about the cool stuff.

There reason that languages have ordering is obvious -- people need to be able to find information. How else can they find words in a dictionary or names in a phone book, if there is not a deterministic ordering that they are expecting?

The principles used to create the orderings vary, depending on the language and to some extent the script. I will give some examples here....

Other languages simply stick additional letters at the end or next ones that look similar. They may use points of articulation, or phonemic values.

More often than not, people know neither the whys nor the wherefores. But they know the right order when they see it....

In future posts I'll look into some of the more interesting trends I have noticed in some of the languages of East Asia, Southeast Asia, and South Asia.


# Luc Cluitmans on 18 Dec 2004 3:57 AM:

There is another issue related to sorting that may cause some surprises. It is not how the order of certain letters in a certain context is 'the wrong way around', but how pairs of 'distinct' letters are treated as equal: one thing that struck me as odd when I moved to Finland five years ago was that the 'V' and 'W' sort as equivalent letters here (at least in dictionaries and phone books). For instance, in the phone book you will see people named 'Vahlberg' and 'Wahlberg' listed intermixed, sorted by their first name. If I understand correctly, the same is true in Sweden.

# Michael Kaplan on 18 Dec 2004 4:20 AM:

Yes, there are many instances like this -- if you click on that link I put in for Appendix D its ones of th most visible differences in the Swedish/Finnish sort.

I also find that sort fascinating given how totally different Finnish and Swedish are as languages -- yet they still sort things the same way!

# Serge Wautier on 18 Dec 2004 6:03 AM:

On a totally unrelated matter, I was surprised that the armenian and georgian characters of your post are rendered correctly in my browser. I looked into the source of the HTML page and didn't find any explicit mention of the font to use (no embedded font neither). The only mentioned font (the classical Verdana,... set) doesn't support Armenian and Georgian chars.
I assume IE (and Firefox) are smart enough to look for a suitable font when the ones mentioned can't display a character (In my case, it found sylfaen).
Now that I write it, I seem to remember that (as far as IE is concerned) this is MLang's job ? Is it right or did I miss some boat ?

# Pavel Šrubař on 18 Dec 2004 6:22 AM:

Many odd rules were standardized in pre-computer age which drives us, the poor programmers, nuts.
<br>Guess how our kings ought to be colated in a phone book according to the Czech standard ČSN 01 0181 ?
<br>Charles III.
<br>Charles IV.
<br>Charles V.
<br>Wrong. Roman numerals sort by the phonetic equivalent:
<br>Charles V. (Charles the <b>fi</b>fth
<br>Charles IV. (Charles the <b>fo</b>urth
<br>Charles III. (Charles the <b>t</b>hird

# Michael Kaplan on 18 Dec 2004 10:51 AM:

MLang is definitely the biggest aide in that work, though the OS also has font linking that sometimes gets used when it knows something that MLang does not (which is admittedly not too often).

It is very cool though. :-)

# Michael Kaplan on 18 Dec 2004 10:55 AM:

I have never read a single attempt at a National standard for sorting that captured a real world usage of a language that users who did not work in standards expected.

You may have just proved it again with ČSN 01 0181. Wow, that is very odd!

(FYI -- Microsoft does not support it for the Czech locale's sort, though I guess we have the excuse that there is no Czech Phonebook sort there. :-)

# Norman Diamond on 27 Dec 2004 4:27 PM:

> For any one ideograph there can be only one
> count although the number of strokes counted
> may vary

The number of strokes counted varies with the number of strokes written, and there can be more than one, even in a single language. If you mean that, in a sort ordering in which one of the keys is the number of strokes, one of the numbers has to be chosen in setting the key, that is correct. But in doing lookups (both in printed dictionaries and in computerized systems) there have to be cross-references from each of the other locations where the character would be found with other accepted stroke counts.

> Thus in Japanese the pattern A (あ) I (い)
> U (う) E (え) O (お) is repeated with
> successive "rows" of the alphabet like
> Ka (か) Ki (き) Ku (く) Ke (け) Ko (こ)

Yes. (There are varying styles on a few rows.)

> and Ga (が) Gi (ぎ) Gu (ぐ) Ge (げ) Go (ご).

No, those don't get rows by themselves. They are part of the Ka row.

> When one asks a native speaker about the
> AIUEO order the same blank stare can often
> be seen.

If one asks why the column order is AIUEO instead of some other column order, then sure. (A likely reason is that some early orthographers based the ordering on Sanskrit but so what.) If one asks why this ordering has columns and rows instead of something less, um, orderly, then an answer is more likely. This is the logical ordering.

There exists a completely different ordering based on a poem by a famous poet, using exactly once each of the syllables that existed in the Japanese language of the time. The i-ro-ha ordering is not ordinarily used in computer systems but can be used in various human systems, including everything from designating varieties in a mail-order catalog to labeling points made in a written submission in a court case. (I just tried inputting a couple of paragraphs to Word 2000 and it even supports that ordering automatically. I'd be surprised if Excel would sort in that order though.)

> Other languages simply stick additional
> letters at the end or next ones that look
> similar.

Or both. After Latin stuck X and Z on the end instead of putting them near where Greek put them, English inherited that. But then English put J near I instead of near the end, and English put Y near the end instead of near I. Consistency is the last refuge of an uncreative programmer.

referenced by

2006/01/03 'Acceptable' Japanese sort order?

2005/11/18 Some sort of order to collation

2004/12/20 IMEs? They have it easy....

go to newer or older post, or back to index or month or day