It's LIFO (last-in, first-out) in Hebrew

by Michael S. Kaplan, published on 2006/10/12 03:56 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/10/12/818619.aspx

You may remember when I posted this last March about Traditional versus modern sorts, specifically talking about two different practices used for the sorting of Jamo -- one used in North Korea and the other in South Korea. And how both had a reasonable basis for their choices (even though they conflicted with each other).

Let's see if I can juxtapose some of the concepts together and perhaps add a new dimension at the same time. :-)

First we will look at the order of the letters again, as I posted them (remember it is right to left!):

Now some of these letters are actually the 'final forms' and are only used at the end of a word. Marking them in red here, with their non-final forms marked in green.

This is the way that Hebrew is often taught (it is certainly the way I learned it). But there are other times it is shown in a slightly different way, with the final forms at the end:

Now as far as I know the text would never expect to be sorted this way, but in the system where Hebrew letters are given numerical values (more info here), the final forms are not given unique numbers, so this ordering can be useful since it means just one number per element, from א to ת.

Now if one were trying to guess which order was being used by Unicode, one might think that the first order would be preferred.

Then after I told you this is not what Unicode does, you might ask (bemused at this point, and assuming some numerological fiend was behind the original proposal) whether the first order was chosen.

Then I'd tell you this was not the order they used either. In Unicode, the order is as follows (from U+05d0 for Alef to U+05ea for Tav):

The answer is pretty much a purely technical one with no real linguistic basis. You see, from a collation standpoint, abcd always comes before abcd_ (where the blank is filled with any letter). And therefore in a language like Hebrew, when that letter d is a final form, it would always come before the d when it is not. And therefore a string comparison determining which string comes first will come to its result faster in this case.

Now as a technical solution I find this to be entirely unsatisfying and more than a little bit lame. Why couldn't things have been set up so that if the word ends with a מ (mem) then it automatically replaces it with a ם (mem sofit, a.k.a. final mem).

It is really (in my opinion) just as bad in its own way as the issue with Korean Jamo where the consonants are encoded twice (once for the initial form, once for the final form), a technique that really has nothing to do with language and is really grounded in a desire for a technical solution that does not take into account what any normal native language speaker recognizes as the truth.

In the Hebrew case in particular, I feel like we have let those native Hebrew speakers down for the encoding itself, in a small way. A way that makes anyone who looks in Character Map have to see this back-asswards ordering:

On the other hand, making the six year old look at the grid from left to right is probably just as annoying; I wonder if the Hebrew localized version of Character Map mirrors the grid?

Now there is one bit of happy news in all this, though (in my opinion), and that is in collation. The final form in Hebrew is given an identical primary weight to the non-final form; there is only a tertiary difference (so passing NORM_IGNORECASE) will treat them as being equal which, in a wider sense, they are.

And it turns out the UCA does something analogous here with a tertiary weight, too. :-)

Because in the end any attempt to support collation in the order of the way the letters are encoded is going to be lame and/or stupid and/or wrong. They would have been better off making the language look better in the list than trying to get clever with the technical solutions....

The Hebrew localized Character Map does mirror the grid (which of course makes the left-to-right alphabets and numerical ranges look pretty odd -- sometimes there are no right answers to Bidi UI questions).

The trouble with automatic replacing of final letters is that there are exceptions. There's a final mem in the middle of a word in Isaiah 9:6, and many words from foreign languages are spelled with a medial כ or פ at the end of the word, e.g. "ג׳יפ", "a jeep", or the name of the Egyptian president חוסני מובארכ. The rule here is simple if you know a little Hebrew grammar: when the last letter of the word is a בג״ד כפ״ת letter with a dagesh and with no following vowel (which can never occur in native Hebrew words), don't use the final form even if one exists.

If Hebrew were being encoded in Unicode today as a new script these problems could all be overcome by using zero-width joiners and non-joiners, but I'm not sure that those had been thought of back when the first Hebrew encodings were defined (I think by IBM in the 1950s).