Why don't all the half forms sort right?

by Michael S. Kaplan, published on 2006/09/25 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/09/25/768715.aspx

George asked via the Contacting Me... link:

I tried to use the Unicode method of creating half forms in Devanagari on Windows. It worked, but then once I did it the sorting seemed to not work correctly for the half form. What am I doing wrong?

George, you did nothing wrong, this one is all us.

First, I should explain for everyone else what we are talking about, what you meant when you mentioned 'the Unicode method of creating half forms in Devanagari'.

It starts with U+200c and U+200d, the ZERO WIDTH NON-JOINER and ZERO-WIDTH JOINER that I have discussed previously, and the effect that these characters can have in Indic scripts.

The effect is best described in the Unicode FAQ on Indic Scripts and Languages and its question #17 (I cannot find on Unicode charts the "half forms" of Devanagari letters (or any other Indic script). These characters are needed to form words such as "patni".)

The three forms, which you will be able to see if you have a conformant browser are:

त्न U+0924 U+094d U+0928 -- Devanagari tna using the tna ligature

त्‍न U+0924 U+094d U+200d U+0928 -- Devanagari tna with a half ta and a full na

त्‌न U+0924 U+094d U+200c U+0928 -- Devanagari tna with a full ta, a visible virama, and a full na

(if you don't see three different forms then you can look at that Unicode FAQ link)

So that part is easy enough.

And one part of the collation story on Windows -- the fact that both ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER both are characters that intentionally have no weight, is also there, as one would expect.

The place where all is not perfectly well is in the compression part -- and there are many defined compressions for languages like Hindi when consonants and independent vowels combine with Candrabindu, Anusvara, Visarga, and Nukta. George must have been trying to get a half form with one of these compression cases like with a nukta, which will work, although it will make it sort in a slightly incorrect way. :-(

As luck would have it, this is not going to be a very common problem since it will sort very close to where it should be and you would only notice the difference if you were doing a test for equality or had do many entries all together (like in a dictionary) where such differences are more easily noticed....

Good catch, George! Of course since the fix for this would require changing the results of strings that valid according to the IsNLSDefinedString function, it would mean a major version change (even if only for specific locales like the Indic ones). Which is the kind of change that can only usually be done in a major version. Really too late for Vista, but this is definitely something to look at for next version....

(In my mind this counts as the sort of bug that I am sorry to see us ship, though I understand why it happened; it solidified in my mind the importance of our close engagement with the Unicode Standard!)

This post brought to you by U+200c and U+200d (a.k.a. ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/10/08 Off by one what, exactly?

2010/07/16 Which form to use if the form keeps changing?

2007/12/16 Why my IUC31 talks were presented on Vista (even though running on a MacBook Pro)

2007/08/26 Blame Kannada! (ಕನ್ನಡ)

go to newer or older post, or back to index or month or day