by Michael S. Kaplan, published on 2007/08/12 13:53 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/12/4352905.aspx
Collation of the Arabic script in Windows has really had its ups and downs over the years.
And customers tend to notice that sort of thing -- there are few things as visible as software screwing up the alphabet.
Some of the results were beyond outrageous -- letters being treated like they were symbols, symbols being treated like they were letters, the wrong flag having the wrong (well, unintuitive) results, and so on.
Of course having people on the team who speak Arabic or Farsi1 or Urdu natively never made that any easier -- since they would of course regularly point out the shortcomings in this regard, and eagerly suggest that they be fixed lest we forget that things were kind of broken. :-)
In Vista we finally fixed many of these issues.
In a singularly unique meeting, I had all of the native speakers of any Arabic script language I could find, there to help me apply the principles embodied in How does Microsoft assign new collation weights? to the Arabic script once and for all:
Sometimes, there is an actual ordering for a specific language we support and it does not conflict with any of the weights that are already there. When that happens, the new characters can simply be inserted, using existing space in the weight table. Other times, there is an actual ordering for a specific language we support and it does conflict with weights that are already there. In those cases, we put it an an exception table. But of course we have to add it somewhere in the default table too, so we end up doing one of a few different things with code points not already there:
- We may add it in the place that one of those default table languages might expect it due to its appearance;
- We may add it in a place consistent with how other characters have been added in (apparently) similar situations;
- We may add it to the end of the list of characters in the script.
Still other times, we may not have a specific language that needs the script but are trying to fill out a subrange of things in Unicode, in which case either of those previous three mechanisms might be used.
A recent message in the Microsoft VOLT users community from Vladmir made me feel pretty good and bad at the same time. In side the message there was the following text:
...now we get much better alphabetical sorting for Persian (but not perfect one: alef + madda should be placed before alef).
This told me two things -- 1) that things really had improved overall (this is a message we have gotten from many other people!), and 2) there was at least one mistake in there in the way
آ (U+0622, a.k.a. ARABIC LETTER ALEF WITH MADDA ABOVE)
is sorted in Persian (and perhaps in other non-Arabic language collations that use the Arabic script, though no feedback has been received on that point).
Basically, it needs to come before
ا (U+0627, a.k.a. ARABIC LETTER ALEF)
with a primary distinction rather than just after it with a secondary one.
It is easy to blame me since it was my checkin (and I do, this is quite unfortunate) or the native Persian speakers for not noticing the difference in time (and I don't, as it is hard to capture all of the differences in a collation such that they can point them out, it really is).
But RCA (root cause analysis) really has been done here in that the exact cause for the regression is known and understood; there is no real need to assign blame (beyond the blame I will assign to myself, as I mentioned!). It will just be a bug to fix for Persian2 in the future which we traded in the process of fixing approximately nine other bugs.
Hopefully the overall improvement will mediate this....
1 - Farsi is what everyone called it then and this issue had not yet been raised. I don't know why we even bother to pretend that we have no direct contact with Iran if we do have contact with the expat community and they do have contact with Iran and start nagging us immediately after pronouncements are made!
2 - And possibly also Urdu, Pashto, and other Arabic script languages if they have the same expectation?
This post brought to you by آ (U+0622, a.k.a. ARABIC LETTER ALEF WITH MADDA ABOVE)
# Vladimir Ivanov on 13 Aug 2007 5:01 PM:
Dear Michael,
or Mishka (if you allow me to use such an endearing Russian word). I'm really very happy to hear from you. I haven't heard about you since the discussion in unicode.org has dissapeared. I have learned much from it and especially from your observations.
Thank you for your comments. It is as useful and interesting as it had been before.
Your Vladimir
referenced by