The sad sad tale of the BARREE YEH

by Michael S. Kaplan, published on 2010/04/19 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/04/19/9997912.aspx


Warning: it will take me some time to get where I am trying to go here. If you lack patience you may want to skip it today....

Before you read this blog, you may want to look at a couple other blogs I have written:

That second blog in particular is important, where it describes some interesting issues of representation in Pashto.

I have been promising my old colleague Irfan (I'm not saying he is old, I am referring to how we go a long way back) since the end of 2007 that I would talk about an issue related to Urdu, and I am finally getting to it now.

His original mail:

As you probably know that unlike Arabic, Urdu uses yeh barree (ے).  But what is not known by many non-Urdu users is that yeh barree can be used in the middle of a word and when it is used, it looks like the Arabic yeh, like this

           سیب 

Currently, our shaping engine doesn’t do this and will show this word as

           سےب

Would it possible to change this in the next version of Uniscribe?  Also, do know what other dependencies there are for it if this issue is decided to be fixed in the future?

Ah, gotta love when script rules block language rules. Not!

Peter contributed the Unicode side of the puzzle here, explaining the nature of the blockage to a "fix" here:

In terms of Unicode, we cannot cause 06D2 to have dual-joining behaviour: that is a normative property of the character, and if we changed the behaviour in Uniscribe we might is so doing break user documents. Jonathan describes a workaround which he believes is what users, in fact already do: type 06CC “choti yeh” for initial and medial positions for both /e/ and /i/ vowels, and use 06D2 in final position when the barree yeh form is required.

It would be possible to propose a new dual-joining barree yah be added to Unicode. But whether that would be useful or not depends on multiple factors, including how existing data is encoded to deal with this, and how users are likely to enter data. For instance, suppose that new character were encoded at 06NN: if users continue to enter 06CC in initial and medial positions for both /e/ and /i/, so that the only change is to use 06NN in final position rather than 06D2, then nothing has been gained, and there may be some detrimental effects because old data is mismatched against new data and implementations. (E.g. users would enter search strings with 06NN and may not find old data with 06D2.)

And this does make the whole idea of adding a new character problematic.

It is hard to argue with usage, trying to cover what people are doing today and have been doing for years.

In relation to Kashmiri and Urdu (which both have this issue), Kamal and Jonathan were having a conversation themselves:

{Kamal}: According to Daniels & Bright (see Table 62.5, The Kashmiri Alphabet, p. 753), both Letter Barree Yeh with half-ring above & Letter Barree Yeh can occur in initial, medial, and final positions while 06D2 is classified as right-linking (i.e. having only final and separate shapes).

{Jonathan:}I suspect the situation with U+06D2 BARREE YEH in Kashmiri would be the same as in Urdu. This character represents an 'e' vowel, while U+06CC FARSI YEH (known as CHOTI YEH in Urdu) represents an 'i' vowel. However, in initial or medial positions, the same YEH is used for either vowel; in the (rare) case that the writer wishes to make the distinction clear, KASRA is added before YEH.

{Jonathan:}Urdu typists, therefore, instinctively type "choti yeh" (U+06CC) for all initial and medial YEH characters, whether these are functioning as 'i' or 'e' vowel sounds (or the 'y' semivowel), and only expect to type a different character when the special BARREE YEH final form is required. It could be argued that medial 'e' should logically be encoded as a (dual-joining) BARREE YEH, and in fact I implemented such a system (in pre-Unicode days), but in practice typists do not think this way.

Given that usage, adding a new character at this point adds many strange legacy issues for all existing documents and trying to find information in them, in addition to complicating text input methods and spell checkers and so on.

And this is assuming that everyone move to the new letter, when we know that not everyone would.

These kinds of "change the way things are done in technology, years after people have have been doing it" are pretty much guaranteed to be complicated.

Irfan pointed out how limited the backcompat issue would be, to which Peter responded:

Responding to a couple of your comments:

 

> should be similar to yeh where the same keystroke will bring two different shapes for barree yeh, based on its place in the word

 

But there is also a problem of two vowels with identical shapes in non-final position, but different shapes in final position. Consider the words for ‘cats’ /billeeyan/ and ‘cows’ /gaa’een/. (My knowledge of Arabic script and of Urdu are very limited: my Romanizations may be a bit off; I’ll attempt to enter the words in Arabic script, but I probably won’t get them completely correct. And I’ll try to colour the yeh in red in each case, but Word won’t let me highlight just that letter in some cases.)

 

بِلّیان

گایٔن

 

For these words, will the user type the same yeh, or different yehs? Since there’s no difference in the initial forms or in the medial forms, one might expect users would enter these the same. (Would they really know they should enter these differently? And what would be displayed on the key caps?) From the singular forms of these words, though, it’s evident that the underlying vowels are different:

 

بِلّی

گاۓ

 

So, linguistically, in the plural forms of words, users should enter different yeh characters, but there’s a good chance they wouldn’t, let alone do that consistently.

Now this puts a cat (or a cow!) among the pigeons!

In talking about other differences that would potentially sway people to believing a new character should be added to Unicode, different sorting needs for the two letters had come up previously, but Irfan was even more interested in the example Peter used:

I believe that the examples you provided of the plurals for cat and cow make the case even stronger of having a dual joining barree yeh.  When writing these plurals, in the traditional sense of using pen and paper, the writer simply selects a glyph—without thinking whether it’s /i/ or /e/.  However, there are two different categories of users when using a PC to write these plurals.  The first group will simply use the glyph, like they do in the traditional method, and second group will use /i/ or /e/ depending on which vowel the word has.  Of course, the accessibility of these glyphs will have some affect on the usage too.

Since I belong to the later group, and am a bit familiar with the writing habits of that group, I would say that this group while writing words similar to cats and cows will type plurals by typing the singular first—just like in English—which will be linguistically correct.

So, there are basically two groups: one enters yeh as a glyph and the other enters yeh which is linguistically correct. We are already supporting the first group, and now we need to support the second group for linguistic reasons, as well as well ensure that we continue to support the first group and don’t break legacy docs.

A bit more back and forth, and Peter found something interesting in newer proposals:

What is interesting for your case, though, is a pair of characters being added to Unicode 5.1:

077A ARABIC LETTER YEH BARREE WITH EXTENDED ARABIC-INDIC DIGIT TWO ABOVE
077B ARABIC LETTER YEH BARREE WITH EXTENDED ARABIC-INDIC DIGIT THREE ABOVE

These are used in the Burushaski language, as well as 06D2 and 06CC. In the proposal doc (06149-bashir-prop.pdf), the author displays these characters as dual-joining. Because of that, the first draft of the Unicode character properties files listed them as dual joining. A question was raised as to why they are dual joining when the character for skeletal form they are based on, 06D2, is right joining. This led to a subsequent doc from the proposal author (07264-arabic-shaping.pdf) in which essentially the same argument you are making is presented.

Very interesting!

These two new characters, being added in Unicode 5.1, are interesting since they weren't really added to any of the fonts as far I could tell.

I stopped to make sure they made it in, and they did, into the Arabic Supplement block.

Yep, there they are. I'll blow them up for your convenience:

Okay, so if one were willing to ignore these little numbers above the letter, one could use these two new characters.

If fonts supported them.

Which they don't seem to be doing yet very well.

The conversation got away from me after that, and I don't think it came up in Unicode again though I'm not sure.

And now we come to the bigger problem. The one I really wanted to cover today.

Just the other day someone was talking about a bug in the Windows Fax component where it was not being mirrored under a Bidi language other than Hebrew or Arabic (a bug no one noticed since even in the three LIPs previously provided for Bidi languages -- Urdu, Pashto, Persian -- this component isn't localized). Sure enough, in true How To [NOT] detect that a locale is bidi form, the code was pretty much doing the following for its check:

BOOL IsBidi(LCID lcid) {
    return (PRIMARYLANGID(lcid) == LANG_ARABIC ||
            PRIMARYLANGID(lcid) == LANG_HEBREW)
}

Oops.

You see, the problem is that so much of Microsoft product, and of Unicode encoding, is done with the trailblazers like the Arabic language, without really taking enough time to recognize that of all of the languages that use the Arabic script, the Arabic language has the simplest requirements.

And all of these "can't ever change" properties in Unicode were assigned before the full measure of the various things people were doing with other Arabic script language was grokked (though much of it was in Daniels and Bright already, it appears not all of it was).

When you imagine the number of cases like the BARREE YEH in Urdu and Kashmiri (which aren't getting their letter) and the number of random one that are (some of which will get the correct joining behavior even  if their original analogues do not), it makes approaching the whole block frustrating for the myriad of people who just want their language to work.

One could argue that some of the fault lies with the other languages added earlier, like Urdu, where the people pushing the inclusion lacked the full understanding of the consequence of these properties, but that is just an unconvincing strawman position.

It may be 100% true, but the languages added later have the same problems, so if those earlier languages had not been added there would be even fewer flexibilities available.

The Microsoft problem I mentioned above, which has similar causes, is more easily fixed since it is just a bug and there is no conformance requirement in keeping bits of code like that dumb. but the standards question is a thornier one, with no good answers for the many extensions to language that are asked of longstanding scripts.

The Arabic case is particularly worse given all of the various behaviors and properties that have to be defined....


Tom Gewecke on 19 Apr 2010 2:35 PM:

On my Mac I can get correct shaping of medial yeh barree when I use a special Urdu font like Nafees Nastaliq.  Does that not work in Windows?

Michael S. Kaplan on 19 Apr 2010 3:02 PM:

I do not have such  font to verify, though I can  state that the font does not conform to Unicode here (so it can avoid Unicode's "bug" for this case!).

Tom Gewecke on 19 Apr 2010 4:56 PM:

If you want to give it a try sometime, the font is at

http://www.crulp.org/software/localization/Fonts/nafeesNastaleeq.html

Does the use of OpenType features to do something like this really create a non-conformity with Unicode?  

Michael S. Kaplan on 19 Apr 2010 5:12 PM:

I'm not 100% clear on the rules here, myself. I just know that it is not conformant to a particular bit of the UCD; what a language-specific exception means is something for standards people to explain. :-)

ErikF on 19 Apr 2010 5:25 PM:

Couldn't you use the ZWJ/ZWNJ mechanism to force the correct behaviour?  That way if you need two glyphs to connect they will, and if you need to separate them they won't; the specified linking would simply be the default.

If this makes no sense, that would be because I'm hardly an expert in Unicode and let Windows handle all the confusing work of handling text.  I try to have a basic understanding of this stuff and even then it's a challenge!

Michael S. Kaplan on 19 Apr 2010 7:32 PM:

No it makes sense, and to a point you are right. But the font has to know what to do when this is done and have the glyphs to show, and of course input methods would need to know about the random extra characters to add them....

Michael S. Kaplan on 19 Apr 2010 9:55 PM:

John CowanHudson attempted to post the following but was having browser troubles posting, so I'll do it here:

Michael, you wrote re. a font such as Nafees Nastaliq that displays yeh barree as dual-joining, "I can  state that the font does not conform to Unicode here."

Well, this is an interesting philosophical issue, because what the font is doing is performing *glyph* substitution, using contextual rules (which might be language system specific), and Unicode is a *character* encoding standard, and the joining properties of Arabic letter characters are ipso facto character properties. The font isn't doing anything to the character string and it isn't interfering with the character processing in the layout engine; rather, it is taking the output of the character processing and standard layout features and performing a glyph level display operation.

My view is that by the time one reaches this point in text layout, Unicode's mandate has ended. But then I'm a font maker and used to resolving display errors due to encoding problems at the glyph level.

Michael S. Kaplan on 19 Apr 2010 10:00 PM:

It is an unwieldy and slippery slope -- by that argument one could legally say a font where "A" looked like "B" and vice versa is conformant due to the work happening in the font, but this reductio ad absurdum is in fact not conformant. So let's just say it is a gray area. :-)

Though one I am in favor of; within the next day or twos (or perhaps both!) I'll be doing some "let's not follow Unicode" thought experiments....

Random832 on 20 Apr 2010 4:21 AM:

Why isn't Arabic shaping a font issue?

I mean - I've seen what opentype can do with Latin cursive fonts. Simply selecting between initial/medial/final/isolated forms within an otherwise static font has limitations it does not (e.g. all pairs have to join at the same vertical position within a script; any two letters which can each join to a letter to its left/right have to join to each other). And making it be a font issue would allow the problem to be solved with separate fonts for different languages just like CJK has separate fonts per language. (And it would set a precedent of keeping joining information out of the normative properties, to allow systems with traditional shaping models to use separate engines per language)

And, for that matter - considering that Unicode does not prescribe the exact appearances of glyphs - what exactly is a testable measure of conformance with the supposedly normative "this character shall not join to characters to its left" statement anyway? If when these characters appear next to each other they just happen to be given glyphs which just happen to line up vertically with one another so as to give the appearance of joining (as is sometimes the case in Latin fonts anyway).... And if they happen to be given those same shapes when appearing next to a ZWJ and not when there is a ZWNJ between them...

Who's to say that's "joining"? After all, Latin doesn't have any joining characters at all, and there are plenty of Latin fonts which look like that (modulo perhaps the respect for ZWJ/NJ)

Tom Gewecke on 20 Apr 2010 6:06 AM:

Regarding the use of ZWJ, it seems to me that would be non-conforming with Unicode unless it were documented in the Arabic chapter of the Standard.

Michael S. Kaplan on 20 Apr 2010 7:35 AM:

See today's (4/20) blog, where I point out that sometimes not listening to Unicode for particular issues pays off.

I'm not against that notion of doing the right thing for the language, but I believe it is better to own it, saying "yes I know about those properties but they break Urdu and Kashmiri so I ignore it for those languages, for that character."

That is a firm approach one can be proud of -- and it can eventually drive the standard to describe it since it is de facto what is being done.

John Hudson on 20 Apr 2010 9:37 PM:

"John Cowan attempted to post the following..."

Actually, that was me, John Hudson. I'd hate for John Cowan to get a reputation for the kind of apparently outragegous opinions that I hold.

"...by that argument one could legally say a font where "A" looked like "B" and vice versa is conformant..."

Well, it would indeed be conformant to Unicode character encoding if the shape change took place at the glyph processing level. It wouldn't be conformant to the Latin alphabet, but that's a cultural convention not a technical standard.

There are some clever, playful OpenType fonts that change certain words into other words, e.g. swear words into the names of vegetables. No one expects such fonts to be used for typical text, and they exist mainly as a form of play and to make, perhaps, an important point about the distinction between text encoding and text display.

Michael S. Kaplan on 20 Apr 2010 11:15 PM:

Sorry John -- I fixed that now. :-)

Well, in my opinion there is better way to appropach it, as mentioned in the earlier comments -- just choose to not conform when Unicode can't do something due to property stability issues and admits their hands are tied!


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day