Some sort of order to collation

by Michael S. Kaplan, published on 2005/11/18 04:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/11/18/492339.aspx


Last weekend, Dean Harding commented when I was talking about preferring uppercase before lowercase or vice-versa:

The whole idea of sorting (at least for latin-based scripts) is just convention anyway... I mean, you may well ask "why should 'A' come before 'a'?" but then why not ask "why should 'a' come before 'z'?" there doesn't seem to be any actual reasoning behind the 'sort order' of our alphabet at all!

Unless, I'm missing something here...

Dean wasn't missing anything here, no. But I thought that might be worth a little discussion. :-)

I remember Cathy and I doing a presentation a while back for the group about collation (similar to the ones we did at IUC22 and IUC23) and we talked about how to someone speaking English or German Å comes after A while in Swedish it comes after Z. After the presentation colleague (and native Swede) Anna came up to gently correct us since although people speaking English might make that 'incorrect' choice, people speaking German would not. We stuck to our guns and it was only after a bit of investigation that she realized that Germans 'got it wrong', too!

Then last year, Kieran suggested that I check out The Language Instinct : How the Mind Creates Language by Steven Pinker. It is one of those great tomes that can speak to people who are not necessarily linguists, and help anyone understand more about language than they ever did previously. I knew I was learning something when over half of the linguistic factoids that were being shared with me each day were things that I had also read in the book. :-)

Anyway, given how fascinated I was by collation, I admit I was at first disappointed that so little was said about the subject in the book, which was really blowing me away otherwise. There is only a brief mention is in Chapter 8:

...for the same reason that alphbetical order is similar across the Hebrew, Greek, Roman, and Cyrillic alphabets. There is nothing special about alphabetical order; it was just the order that the Caananites invented, and all Western alphbets came from theirs.

I admit that I felt a little odd having such a passion about something that apparently had very little meaning in a 'linguistic' sense. It is probably about the time that I dubbed the term 'delusions of linguistic aptitude' to describe myself. :-)

Now I got over it pretty quickly, because I realized that this one sentence did not invalidate my interest, and it certainly didn't invalidate the importance of it given all of the places that collation is used. It is shortly after that time when I posted Putting Your Ducks in a Row about different 'alphabetical orders' -- the large degree of variation between them and some of the many different principles behind them. And this does not even get into the fact that people are used to them and are confused any time things are not in the order that they expect (which is usually alphabetical order).

I am still fascinated by collation, probably at least in some part now because as cool as language as an instinct is, the fact that collation is so ingrained in people that they do not even realize its not an instinct (hell, in most cases they do not even realizes it varies between languages in a single script!)....

 

This post brought to you by "Ա" (U+0531, a.k.a. ARMENIAN CAPITAL LETTER AYB)


# orcmid on 18 Nov 2005 12:55 PM:

Thanks for the recommendation on "The Language Instinct." To achieve the free shipping from amazon.com, I also ordered "How the Mind Works" and "Blank Slate." I'd avoided Pinker because I probably concluded too readily that he was leaning too heavily toward biological determinism. My loss, I think.

Do you have other works to recommend? I'm interested in language as related to information processing and also the business of Unicode and all of the complex ways that the codes are used in expressing texts of different languages. I have the Unicode 3.0 and 4.0 books but would like some other guide to practice.

[I have this vague recollection that code-point sorting reverses the "a" (lower-case)versus "A" (upper case) choice in EBCDIC versus ASCII, and that might still influence some views on the matter for geeks in Roman togas. I suspect a fading generational difference in what is thought to be the "right" answer among techies.]

# Petr Kadlec on 19 Nov 2005 8:50 AM:

Re: "I felt a little odd having such a passion about something that apparently had very little meaning in a 'linguistic' sense."

There are languages that 'do' have some meaning in their alphabets: ;-)

This script was not in origin an 'alphabet', that is, a haphazard series of letters, each with an independent value of its own, recited in a traditional order that has no reference either to their shapes or to their functions. It was, rather, a system of consonantal signs, of similar shapes and style, which could be adapted at choice or convenience to represent the consonants of languages observed (or devised) by the Eldar. None of the letters had in itself a fixed value; but certain relations between them were gradually recognized.
(J. R. R. Tolkien: The Lord of the Rings, Return of the King, Appendix E)

# Michael S. Kaplan on 19 Nov 2005 9:12 AM:

Hi Petr -- that is much easier to do with constructed languages than with ones that exist in language communities, though. Isn't it? :-)

Orcmid -- let me give that some thought. I actually did try to tackle some harder to approach linguistic work after reading several Pinker books and found it more challenging without formal training. But there are some folks I can ask here about other recommendations. And I will try to post soon on Unicode and related references. :-)

referenced by

2008/08/28 Collation backstory?

2008/02/23 Despite progression, the bug calls out to me quite LAOdly

2006/02/12 Collation can actually be linguistic

2006/01/01 Sorting multilingual data

go to newer or older post, or back to index or month or day