by Michael S. Kaplan, published on 2006/10/12 03:56 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/10/12/818619.aspx
You may remember when I posted this last March about Traditional versus modern sorts, specifically talking about two different practices used for the sorting of Jamo -- one used in North Korea and the other in South Korea. And how both had a reasonable basis for their choices (even though they conflicted with each other).
And you may also remember just yesterday when I posted You got your latins in my hebrew! No, you got your hebrew in my latins!, a post that was showing the Hebrew alphabet over and over again using different fonts.
Let's see if I can juxtapose some of the concepts together and perhaps add a new dimension at the same time. :-)
First we will look at the order of the letters again, as I posted them (remember it is right to left!):
אבגדהוזחטיכךלמםנןסעפףצץקרשת
Now some of these letters are actually the 'final forms' and are only used at the end of a word. Marking them in red here, with their non-final forms marked in green.
אבגדהוזחטיכךלמםנןסעפףצץקרשת
This is the way that Hebrew is often taught (it is certainly the way I learned it). But there are other times it is shown in a slightly different way, with the final forms at the end:
אבגדהוזחטיכלמנסעפצקרשתךםןףץ
Now as far as I know the text would never expect to be sorted this way, but in the system where Hebrew letters are given numerical values (more info here), the final forms are not given unique numbers, so this ordering can be useful since it means just one number per element, from א to ת.
Now if one were trying to guess which order was being used by Unicode, one might think that the first order would be preferred.
Then after I told you this is not what Unicode does, you might ask (bemused at this point, and assuming some numerological fiend was behind the original proposal) whether the first order was chosen.
Then I'd tell you this was not the order they used either. In Unicode, the order is as follows (from U+05d0 for Alef to U+05ea for Tav):
אבגדהוזחטיךכלםמןנסעףפץצקרשת
Huh? Why the hell are the final forms listed before the non-final ones?
Well, it is not only that way in Unicode; you can see the same thing in Windows code page 1255, OEM code page 862, and ISO code page 8859-8.
Which is of course not an answer to the question, at all, now is it? :-)
The answer is pretty much a purely technical one with no real linguistic basis. You see, from a collation standpoint, abcd always comes before abcd_ (where the blank is filled with any letter). And therefore in a language like Hebrew, when that letter d is a final form, it would always come before the d when it is not. And therefore a string comparison determining which string comes first will come to its result faster in this case.
Now as a technical solution I find this to be entirely unsatisfying and more than a little bit lame. Why couldn't things have been set up so that if the word ends with a מ (mem) then it automatically replaces it with a ם (mem sofit, a.k.a. final mem).
It is really (in my opinion) just as bad in its own way as the issue with Korean Jamo where the consonants are encoded twice (once for the initial form, once for the final form), a technique that really has nothing to do with language and is really grounded in a desire for a technical solution that does not take into account what any normal native language speaker recognizes as the truth.
In the Hebrew case in particular, I feel like we have let those native Hebrew speakers down for the encoding itself, in a small way. A way that makes anyone who looks in Character Map have to see this back-asswards ordering:
This is an ordering that even a six year old knows is wrong!
On the other hand, making the six year old look at the grid from left to right is probably just as annoying; I wonder if the Hebrew localized version of Character Map mirrors the grid?
Now there is one bit of happy news in all this, though (in my opinion), and that is in collation. The final form in Hebrew is given an identical primary weight to the non-final form; there is only a tertiary difference (so passing NORM_IGNORECASE) will treat them as being equal which, in a wider sense, they are.
And it turns out the UCA does something analogous here with a tertiary weight, too. :-)
Because in the end any attempt to support collation in the order of the way the letters are encoded is going to be lame and/or stupid and/or wrong. They would have been better off making the language look better in the list than trying to get clever with the technical solutions....
This post brought to you by ץ (U+05e5, a.k.a. HEBREW LETTER FINAL TSADI)
# Simon Montagu on 12 Oct 2006 8:23 AM:
The Hebrew localized Character Map does mirror the grid (which of course makes the left-to-right alphabets and numerical ranges look pretty odd -- sometimes there are no right answers to Bidi UI questions).
The trouble with automatic replacing of final letters is that there are exceptions. There's a final mem in the middle of a word in Isaiah 9:6, and many words from foreign languages are spelled with a medial כ or פ at the end of the word, e.g. "ג׳יפ", "a jeep", or the name of the Egyptian president חוסני מובארכ. The rule here is simple if you know a little Hebrew grammar: when the last letter of the word is a בג״ד כפ״ת letter with a dagesh and with no following vowel (which can never occur in native Hebrew words), don't use the final form even if one exists.
If Hebrew were being encoded in Unicode today as a new script these problems could all be overcome by using zero-width joiners and non-joiners, but I'm not sure that those had been thought of back when the first Hebrew encodings were defined (I think by IBM in the 1950s).
# Michael S. Kaplan on 12 Oct 2006 10:48 AM:
Fair enough -- though how many native Hebrew speakers think that order looks right? :-)
# Joel Spolsky on 15 Oct 2006 5:15 AM:
A more common place you'll see non-final forms at the end of a word is in abbreviations... יבמ is IBM, for example, and they did the first encoding :)
referenced by
2007/09/02 Acronyms vs. initialisms, across languages