Putting your ducks in a row
by Michael S. Kaplan, published on 2004/12/18 02:13 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/18/323941.aspx
Collation fascinates me.
It has fascinated me for most of the last eight years, since the first time I saw Appendix D of the first edition of Developing International Software for Windows 95 and Window NT. Like most people, I knew that some languages had different letters but it had just never occurred to me that letters would ever be ordered differently.
What I find most fascinating is that by and large people understand the order well enough to make use of it to retrieve information or to recognize when it is out of order. Yet they usually cannot clearly articulate the rules even when they have a clear subconscious knowledge of it.
Like all good internationalization features, people really only notice them if they do not meet expectations. This is something else that I find fascinating because I have never felt like it was all about me, even if I am doing cool stuff. It is about the cool stuff.
There reason that languages have ordering is obvious -- people need to be able to find information. How else can they find words in a dictionary or names in a phone book, if there is not a deterministic ordering that they are expecting?
The principles used to create the orderings vary, depending on the language and to some extent the script. I will give some examples here....
- Some ideographic languages (such as Chinese and Japanese) use a stroke based ordering. This can either be a simple count of the total number of strokes used in the ideoraph or it can be a radical/stroke based ordering where the basic radicals all have a specific order and then all of the strokes outside of the radical are then counted. For any one ideograph there can be only one count although the number of strokes counted may vary depending on whether the unified ideograph is Hanja (Korean), Kanji (Japanese), or Han (Chinese). Since there are generally under 50 strokes on even the most complex ideographs yet there are at minimum tens of thouands of them, there are many duplicates for each count and thus there must be a seconary method describing how to break those ties. Examples of tie breaking methods include order in a National standard, order in Unicode, or the order in which the strokes are ordered according to ideal construction.
- Some ideographic languages (such as Chinese and Japanese and Korean) use a pronunciation based ordering. Like with the stroke based orderings there sill be many duplicates, but unlike the stroke based ordering a single ideograph can have multiple pronunciations. This problem is eased in some cases by the fact that a dictionary has a clear interest in placing a word in multiple positions if there are multiple pronunciations with different meanings and the order of the dictionary is based on pronunciation. Beyond that, computers generally solve the problem either by allowing the user to strore the pronunciation so they can easily find an entry later or by using the most poopular/common/generic pronunciation.
- Some languages base the ordering on an older language (upon which the current language was based). The greek Alpha (Α) Beta(Β) Gamma (Γ) Delta (Δ) compared to the Cyrllic A (А) Be (Б) V (В) Ghe (Г) De (Д) compared to the Hebrew Alef (א) Bet (בּ) Vet (ב) Gimel (ג) Daled (ד) compared to the Armenian Ayb (Ա) Ben (Բ) Gim (Գ) Da (Դ) compared to the Georgian An (ა) Ban (ბ) Gan (გ) Don (დ) obviously have some sort of shared heritage of ordering. At this point we just do it this way because they did it that way before, and any attempt to ask about the order will usully elicit a blank stare.
- Some languages which use the same letters as another change the ordering based on phonemic principles (thus in Lithuanian the letter Y sorts after the letter I, rather than nearer to the end of the alphabet.
- Some languages which use similar letters as another put the letters after the letters to which they are similar. Thus A Macron (Ā) comes after A in most languages that use it, as does A Ring (Å).
- Some languages consider those similar letters to nevertheless be entirely separate letters which are put at the end of the Alphabet. Thus in Swedish A Ring (Å) is a unique letter that comes after Z, rather than a variation of the letter A.
- Some languages use a regular, repeated order combining consonant and vowel sounds. Thus in Japanese the pattern A (あ) I (い) U (う) E (え) O (お) is repeated with successive "rows" of the alphabet like Ka (か) Ki (き) Ku (く) Ke (け) Ko (こ) and Ga (が) Gi (ぎ) Gu (ぐ) Ge (げ) Go (ご). When one asks a native speaker about the AIUEO order the same blank stare can often be seen.
Other languages simply stick additional letters at the end or next ones that look similar. They may use points of articulation, or phonemic values.
More often than not, people know neither the whys nor the wherefores. But they know the right order when they see it....
In future posts I'll look into some of the more interesting trends I have noticed in some of the languages of East Asia, Southeast Asia, and South Asia.
# Luc Cluitmans on 18 Dec 2004 3:57 AM:
There is another issue related to sorting that may cause some surprises. It is not how the order of certain letters in a certain context is 'the wrong way around', but how pairs of 'distinct' letters are treated as equal: one thing that struck me as odd when I moved to Finland five years ago was that the 'V' and 'W' sort as equivalent letters here (at least in dictionaries and phone books). For instance, in the phone book you will see people named 'Vahlberg' and 'Wahlberg' listed intermixed, sorted by their first name. If I understand correctly, the same is true in Sweden.
# Michael Kaplan on 18 Dec 2004 4:20 AM:
Yes, there are many instances like this -- if you click on that link I put in for Appendix D its ones of th most visible differences in the Swedish/Finnish sort.
I also find that sort fascinating given how totally different Finnish and Swedish are as languages -- yet they still sort things the same way!
# Serge Wautier on 18 Dec 2004 6:03 AM:
On a totally unrelated matter, I was surprised that the armenian and georgian characters of your post are rendered correctly in my browser. I looked into the source of the HTML page and didn't find any explicit mention of the font to use (no embedded font neither). The only mentioned font (the classical Verdana,... set) doesn't support Armenian and Georgian chars.
I assume IE (and Firefox) are smart enough to look for a suitable font when the ones mentioned can't display a character (In my case, it found sylfaen).
Now that I write it, I seem to remember that (as far as IE is concerned) this is MLang's job ? Is it right or did I miss some boat ?
# Pavel Šrubař on 18 Dec 2004 6:22 AM:
Many odd rules were standardized in pre-computer age which drives us, the poor programmers, nuts.
<br>Guess how our kings ought to be colated in a phone book according to the Czech standard ČSN 01 0181 ?
<br>Wrong. Roman numerals sort by the phonetic equivalent:
<br>Charles V. (Charles the <b>fi</b>fth
<br>Charles IV. (Charles the <b>fo</b>urth
<br>Charles III. (Charles the <b>t</b>hird
# Michael Kaplan on 18 Dec 2004 10:51 AM:
MLang is definitely the biggest aide in that work, though the OS also has font linking that sometimes gets used when it knows something that MLang does not (which is admittedly not too often).
It is very cool though. :-)
# Michael Kaplan on 18 Dec 2004 10:55 AM:
I have never read a single attempt at a National standard for sorting that captured a real world usage of a language that users who did not work in standards expected.
You may have just proved it again with ČSN 01 0181. Wow, that is very odd!
(FYI -- Microsoft does not support it for the Czech locale's sort, though I guess we have the excuse that there is no Czech Phonebook sort there. :-)
# Norman Diamond on 27 Dec 2004 4:27 PM:
> For any one ideograph there can be only one
> count although the number of strokes counted
> may vary
The number of strokes counted varies with the number of strokes written, and there can be more than one, even in a single language. If you mean that, in a sort ordering in which one of the keys is the number of strokes, one of the numbers has to be chosen in setting the key, that is correct. But in doing lookups (both in printed dictionaries and in computerized systems) there have to be cross-references from each of the other locations where the character would be found with other accepted stroke counts.
> Thus in Japanese the pattern A (あ) I (い)
> U (う) E (え) O (お) is repeated with
> successive "rows" of the alphabet like
> Ka (か) Ki (き) Ku (く) Ke (け) Ko (こ)
Yes. (There are varying styles on a few rows.)
> and Ga (が) Gi (ぎ) Gu (ぐ) Ge (げ) Go (ご).
No, those don't get rows by themselves. They are part of the Ka row.
> When one asks a native speaker about the
> AIUEO order the same blank stare can often
> be seen.
If one asks why the column order is AIUEO instead of some other column order, then sure. (A likely reason is that some early orthographers based the ordering on Sanskrit but so what.) If one asks why this ordering has columns and rows instead of something less, um, orderly, then an answer is more likely. This is the logical ordering.
There exists a completely different ordering based on a poem by a famous poet, using exactly once each of the syllables that existed in the Japanese language of the time. The i-ro-ha ordering is not ordinarily used in computer systems but can be used in various human systems, including everything from designating varieties in a mail-order catalog to labeling points made in a written submission in a court case. (I just tried inputting a couple of paragraphs to Word 2000 and it even supports that ordering automatically. I'd be surprised if Excel would sort in that order though.)
> Other languages simply stick additional
> letters at the end or next ones that look
Or both. After Latin stuck X and Z on the end instead of putting them near where Greek put them, English inherited that. But then English put J near I instead of near the end, and English put Y near the end instead of near I. Consistency is the last refuge of an uncreative programmer.
go to newer or older post, or back to index or month or day