Collation can actually be linguistic

by Michael S. Kaplan, published on 2006/02/12 18:05 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/12/530610.aspx

I pointed out in the post Some sort of order to collation that it is easy to dismiss linguistic issues when one is thinking about collation. As Steven Pinker pointed out in The Language Instinct:

But it is not as simple as that. Looking at the beginnings of the alphabets for Hebrew:

and so on, we are as struck by the similarities (e.g. in Hebrew there is actually both a בּ (bet) and a ב (vet) that show up after the Alef, just as there is both a Бб (be) and a Вв (ve) after the А in Russian) as by the differences (e.g. there are two 'v' sounds in Hebrew, neither of which are anywhere near the 'v' in English -- or that in Hebrew א (alef) is silent while in most other languages it is not).

Obviously there is a commonality here that is not accidental, but just as obviously the actual letters present and the order of those letters has changed over time in different languages.

There are many possible reasons for change here, and looking at the differences from the original order from the Caananites gave us and any language today, many of the reasons for changes in order have either an orthographic or a phonemic basis.

Now this is especially true as we look at languages that pick up the use of a script such as Latin, Cyrillic, or Arabic and find the need to add letters. Because obviously they need a place to put those additional letters within their alphabetic order, and there are obvious reasons to choose a linguistic basis for that ordering.

Now this ordering may conflict with what a user of the script but not of the language may have for a letter -- thus ڇ (tcheheh) will seem to many Arabic language readers like a ح (hah) with four funny dots in it, similar to how I (as a speaker of English) might look at ṻ (u with macron and diaresis) as a u with some funny smudges on top of it.

But in the context of both the English and Arabic languages, we are both 100% correct.

And while the decision of where I would place them in an ordered list will likely be after ح and u,on the arbitrary basis they look a bit like them, it is not really going to be the same for languages that might make use of the characters.

Where they might be placed in the alphabetical order of a language that makes use of ڇ or ṻ is likely be very different. Since our answer was on the basis of ignorance of what the letter is, it would only make sense that their knowledge of the letter and what it does will guide their notion of where it belongs alphabetically.

This is an issue that I will be posting about in the future, with some more specific examples, giving both the "ignorant" and "knowledgable" viewpoints....

This is an interesting topic. For one thing, these scripts (Hebrew, Latin, Greek, Cyrillic, and Arabic) were all derived from the old Phoenician, so they all inherited the same order. (In fact, it is my belief that the Phoenicians invented the concept of "alphabetical order".) Are there any other languages that had a defined collation order before standardized orders were required by governments?

The thing is, letters that were not used sometimes were left out and new letters were sometimes added. This is interesting because it raises the issue of where new letters are to be added.

Some people added new letters only where old ones were deleted. For example, we have 'G' in the current position because that's where 'Z' was. Since the 'Z' wasn't being used, the 'G' (derived from 'C') was put in its place in order to maintain the alphabetical order.

In other cases new letters were added to the end of the alphabet. The 'Z' is last in English because it was later borrowed back from Greek (along with 'Y').

And the last option is to put new letters after what they are derived from, which is why 'J' (the consonant form of 'I') is right after 'I', and 'V' & 'W' are after 'U'.

See http://www.evertype.com/standards/wynnyogh/thorn.html for an interesting discussion of where to put the Latin letter thorn.