Every character has a story #18: U+06cc and U+064a (ARABIC LETTER FARSI YEH and ARABIC LETTER YEH)

by Michael S. Kaplan, published on 2006/02/14 05:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/14/531572.aspx

Some people may recall when I talked about how It does not always pay to be compatible. In that post I talked about how in Arabic there are four possible shapes for each character -- isolated, intial, medial, and final.

I'm going to talk about a slightly different aspect today, something seen in Persian (sometimes known as Farsi), a language of Iran and by a huge community outside of Iran.

Now Persian uses the Arabic script in Unicode (and I am not going to discuss the position of some people that it should be disunified from Arabic in this post), with some important differences -- one of which I will talk about now.

It is ی (U+06cc, a.k.a. ARABIC LETTER FARSI YEH).

Now in Arabic there is ي (U+064a, a.k.a. ARABIC LETTER YEH).

You can see how they are different -- namely those two dots on the bottom of ARABIC LETTER YEH. Those two dots are not used in Persian, just as without the two dots is not a yeh in Arabic.

Now if you look at the final form of the yeh, say comparing ئی (U+0626 U+06cc) to ئي (U+0626 U+064a), you will again see the difference of those two dots.

We are not really on much of a roll though, since both the initial and medial forms are identical, e.g. یؤ (U+06cc U+0624) versus يؤ (U+064a U+0624) and ئیؤ (U+0626 U+06cc U+0624) versus ئيؤ (U+0626 U+064a U+0624).

Well, at least this sort of points out that we are talking about the same letter with two different traditions for writing in two different languages....

But as you can imagine there may be no end to the potential confusability between them, given the fact that they look the same in many cases and at some deep historical linguistic level they are the same letter no matter what Unicode does.

(Now some would say that ى (U+0649, ARABIC LETTER ALEF MAKSURA) adds to the confusion a bit and they are probably right, but we'll deal with that another time!)

And there is that issue I mentioned before with the people who would rather see a disunification of Persian and Arabic. The one that I was not going to talk about.

The apparent similarities to cases like Bengali vs. Assamese (discussed in A script, by any other name) or Bosnian vs. Croatian vs. Serbian are indeed valid, for some values of valid. The issue in all of these scenarios is that they would have completely valid concerns, if Unicode encoded languages.

But Unicode encodes scripts. And no one is really trying to claim that there is no common historical root between the languages.

So how do characters like U+06cc get encoded?

Well, usually the argument in favor of them is the need to have both Arabic language and Persian language in plain text, where the distinction needs to be made between them.

(For present purposes we will define "plain text" as any situation where you could not be reasonably expected to handle the situation with different fonts -- an issue that Unicode is especially sensitive to due to the fact that it so often speaks ill of "font hack" type solutions in 8-bit encoding standards.)

Now note that Windows Code Page 1256 does not include both U+064a and U+06cc, though it does roundtrip U+064a and has U+06cc as a best fit mapping.

So I guess you could say that Microsoft did unify the two letters, at least for non-Unicode applications.

But don't go running to have those non-Unicode Persian applications -- the fact that there are other characters that are missing, and there is no space left.

Looking at the Unicode Character Database for a moment, both characters have been in Unicode since at least version 1.1, released in June of 1993.

Looking at the Microsoft collation tables, they are close to each other bit not quite equal, something that maybe ought to be changed for the sake of people who recognize them as (to borrow Jeremy Piven's phraseology) "brothers from another mother" tongue, though direct comparisons are generally the sort of thing that ought to be done in security situations, which the linguistic collation functions are not really suited for anyway....


This post brought to you by "ی" and "ي(U+06cc and U+064a, a.k.a. ARABIC LETTER FARSI YEH and ARABIC LETTER YEH)

# Yousf on 14 Feb 2006 9:47 AM:

Thanks very much.. :)
this blog if very usfull..

# Roozbeh Pournader on 15 Feb 2006 6:33 AM:

It may be insteresting to know that Pashto actually uses both of the Yehs orthograpically. One is used for the [i] sound, the other for an [aj] sound.

It also has three other Yehs, but that's a different story.

# Najibullah Nayel on 8 Nov 2007 2:52 AM:


I visited your site, I don't know, do you have some Pashto fonts, please send me.

# Michael S. Kaplan on 8 Nov 2007 2:57 AM:

I do not have Pashto fonts to send, no. Sorry!

# mohebbi on 20 Aug 2008 8:58 AM:

Very interesting!

It is very nice to me someone outside Iran knows this difference!

