On Bengali sorting (where an old part of my personal life mirrors an even older bug in Windows)

by Michael S. Kaplan, published on 2010/09/02 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/09/02/10057075.aspx


Although it doesn't start off very technical, this blog will become technical quickly enough. Standard disclaimers about all of the people I am not speaking for apply, obviously....

Preparatory information for this blog you are reading which may be useful to provide context:

My interest in Indic languages in general and Bengali in particular pre-dates my personally knowing any Bengalis (not entirely true since there are friends of mine like Trina Saha who I have known for many years but I did not know she was Bengali when I first knew her all those years ago, so it may as well be true).

Also, I have a bunch of friends from India and Bangladesh that I have met over the last few years since that time, and several of those friends are on Facebook.

A few years back I dated a "Bengali" woman, and from my limited sample size of one I'll confirm that whatever you have heard about beauty, intelligence, and passion about them is entirely true.

The relationship ended up not working out when all was said and done, and I'm pretty sure that was in part (though not completely) my fault. And I did not fully understand some aspects of the relationship itself, as is often expected of men in such situations, though it was probably entirely understandable from the proper vantage point.

All of this information is something you should keep in mind as it provides both source material for the problem I'll be discussing (the bug, not the relationship!), and the framework for an extended analogy describing a product bug in terms of my personal life. If that sort of thing annoys you, then please feel free to find another
MSDN Blog to read; I find it to be relaxing to identify metaphorical patterns such as this.

The blog itself follows. :-)

It all started the other day, on Facebook.

My friend there, Tanbin Islam Siyam, posted the following screenshot and text:



এক্সেল আর ওয়ার্ড দুটোতেই চেষ্টা করলাম, অজ্ঞাত কারণে উ-কার, ঊ-কার ও ঋ-কার উপরে উঠে যাচ্ছে। কি করে সম্ভব??

 After which, friend Rifat Nabi included the following screenshot and text in the comments:



 I'm using Office 2010. Check this out - [N.B- This one is also wrong :( but different]

As you can probably guess even if you know no Bengali whatsoever, both complaints were about the sort order of the text.

Now the mistakes people often make when they look to see text sorted properly for Indic languages on Windows fall into a few broad categories:

  1. Some people expect the text to be sorted on a version of Windows prior to that language being added;
  2. Other people expect the text to be sorted correctly in a Word table even if they did not mark the text language properly;
  3. Still other people expect the text to be sorted correctly in Excel even if they did not set their default user locale to the correct language;
  4. And still other people expect the text to be sorted correctly in Access or SQL Server even if the collation was not set correctly.

Now with both examples clearly in Excel, I assumed that problem #2 was involved, or maybe problem #1.

But it was actually none of those things.

You see, while it is true that you need to be using the right locale in order to get the full support, the truth is that this is primarily for the sake of proper sorting when certain specific characters are included, in the case of Bengali characters like the following:

The truth is that for everything other than the way these four characters combine with the other letters, the sort is handled just fine by the default table any time the underlying platform supported the sort, and you don't have to set the locale or use the collation to get the right behavior for Bengali.

In this case, the first Excel spreadsheet had the following in it:

ক       U+0995        KA   BENGALI LETTER KA
কা      U+0995 U+09be  KAA  BENGALI LETTER KA + BENGALI VOWEL SIGN AA
কি      U+0995 U+09bf  KI   BENGALI LETTER KA + BENGALI VOWEL SIGN I
কী      U+0995 U+09c0  KII  BENGALI LETTER KA + BENGALI VOWEL SIGN II
কু      U+0995 U+09c1  KU   BENGALI LETTER KA + BENGALI VOWEL SIGN U
কূ      U+0995 U+09c2  KUU   BENGALI LETTER KA + BENGALI VOWEL SIGN UU
কৃ      U+0995 U+09c3  KAR   BENGALI LETTER KA + BENGALI VOWEL SIGN VOCALIC R
কে      U+0995 U+09c7  KE   BENGALI LETTER KA + BENGALI VOWEL SIGN E
কৈ      U+0995 U+09c8  KAI  BENGALI LETTER KA + BENGALI VOWEL SIGN AI
কো      U+0995 U+09cb  KO   BENGALI LETTER KA + BENGALI VOWEL SIGN O
কৌ      U+0995 U+09cc  KAU  BENGALI LETTER KA + BENGALI VOWEL SIGN AU

And the problem? That the order was not what they expected.

Kind of like the way that relationship had things not as I expected them. I mean, some of it was, but other parts confused me and stuff just seemed out of sorts at time. I'm sure she felt the same way....

Now in looking into this, I started by grabbing the sort keys of the characters in question:

ক       U+0995        33 20 01 01 01 01 00
কা      U+0995 U+09be 33 20 33 30 01 01 01 01 00
কি      U+0995 U+09bf 33 20 33 31 01 01 01 01 00
কী      U+0995 U+09c0 33 20 33 32 01 01 01 01 00
কু      U+0995 U+09c1 33 20 01 07 01 01 01 00
কূ      U+0995 U+09c2 33 20 01 08 01 01 01 00
কৃ      U+0995 U+09c3 33 20 01 09 01 01 01 00
কে      U+0995 U+09c7 33 20 33 33 01 01 01 01 00
কৈ      U+0995 U+09c8 33 20 33 34 01 01 01 01 00
কো      U+0995 U+09cb 33 20 33 35 01 01 01 01 00
কৌ      U+0995 U+09cc 33 20 33 36 01 01 01 01 00

And suddenly I had my first clue, with the items marked in red.

Why did some of these entries have a diacritic weight?

This sent me to the Windows protocol Docs, in particular 3.1.5.2.3 Accessing the Windows Sorting Weight Table, which got me to 7 Appendix B: Windows Sorting Weight Tablewhich got me eventually to the 16.5mb Windows 7 and Windows Server 2008 R2 Sorting Weight Table, from which I extracted the following relevant entries:

0x09c1  1   0   5  0  ;Bengali Vowel Sign U
0x09c2  1   0   6  0  ;Bengali Vowel Sign Uu
0x09c3  1   0   7  0  ;Bengali Vowel Sign Vocalic R
0x09c4  1   0   8  0  ;Bengali Vowel Sign Vocalic Rr
0x09e2  1   0   10 0  ;Bengali Vowel Sign Vocalic L
0x09e3  1   0   11 0  ;Bengali Vowel Sign Vocalic Ll

0x0995  51  32  2  2  ;Bengali Ka

0x09be  51  48  2  2  ;Bengali Vowel Sign Aa
0x09bf  51  49  2  2  ;Bengali Vowel Sign I
0x09c0  51  50  2  2  ;Bengali Vowel Sign Ii
0x09c7  51  51  2  2  ;Bengali Vowel Sign E
0x09c8  51  52  2  2  ;Bengali Vowel Sign Ai
0x09cb  51  53  2  2  ;Bengali Vowel Sign O
0x09cc  51  54  2  2  ;Bengali Vowel Sign Au

Now there is really no good reason for the first six vowels to sort with just "secondary" distinction while the last seven vowels do not; although there is currently some dispute about how Bengali should in fact be sorted, it is fairly obvious that all of the dependant vowels should be sorted the same way.

As luck would have it, in the typical way one might try to sort Bengali one could finesse the weight tables in a way to make either of these options work, but there is no option that will let you work by combining the two and putting half of the vowels in each category. And to be honest it would be easier and more consistent with other Indic languages and a better experience all the way around for several other reasons not relevant at the moment to not use the diacritic weights here,

No other Indic language appears to have this specific problem; it only exists in Bengali.

I am sure my Ex, if I were to ask her, would admit that "Michael, some of the problems in the relationship were because of you" (or "u" to use texting lingo).

Thus it should not surprise me that about 33% of the problem relates to the way BENGALI VOWEL SIGN U and BENGALI VOWEL SIGN UU are sorting. Pardon the pun.

My Ex would probably not be surprised if I said that I didn't fully understand what happened and what didn't work out.

This may well correspond to the fact that the other 67% of the main problem relates to the VOWEL SIGN VOCALIC R, RR, L and LL sort, since even now that I am able to pronounce them, I still don't fully understand how the vocalic vowels work from a linguistic standpoint.

I am reasonably certain my Ex would admit that I was right all along when I initially pointed out that she and I were really not compatible, with all of the differences in cultural background and age and general cynicism about life. And that despite the intellect the interest we did connect on that it never really was going to be the right thing for either of us.

One interesting facet of the bug is that if you go back to 7 Appendix B: Windows Sorting Weight Table and the 3.3mb Windows NT 4.0 through Windows Server 2003 Sorting Weight Table, you will see that this dependant vowel category split has existed since these code points were first added to the Windows default weight table during the Windows 2000 beta, which was several versions before true linguistic support, or fonts, was even claimed to be added to Windows.

So given the historical weights that have been wrong all this time it is fair to say that Bengali collation has not only always been broken on Windows, but also it has also been broken for as long as Bengali collation support has been claimed to be present. It was never really ever the right thing happening.

She (the Ex) and I did get along and had some great  moments, and it was only when you looked really closely that you saw the problems. If we were less self aware we may have lasted much longer.

On Windows the Bengali results are never truly awful except when you really look closely at the details and compare the way one set of vowels sort with consonants to the way the other vowels sort with the same consonants.

And at first both she and I discounted the issues in the relationship since it was new and interesting and who overthinks a relationship before it exists?

Now technically the collation weights have been wrong in this way since Windows 2000 Beta 1, but since no one claimed the support was right until Vista we would have discounted such reports by saying the collation was not expected to be accurate yet. Who would have taken the early bug report seriously?

Unless you looked up close at the aforementioned details, she (the Ex) and I were just fine together.

And since the primary weights of all the consonants and half the vowels were correct, you have to craft specific examples to see the problem in the sort -- otherwise everything seemed just fine. And no one complained for years, which helps prove my point.

And now we finally get to where my analogy breaks down.

Because although it is unlikely in the extreme that she (the Ex) and I would try a romantic relationship again (it is the Tamil -- and Kerala -- girls who have been catching my eye more recently anyway!), there is a much more reasonable chance that some future version of Windows could support a Bengali that did not have the bug with these six letters (and a few other less systemic minor problems I noticed cleared up, too).

Oh, and by the way? Assamese is broken, too. On the same letters.

If you are a tester from the team formerly known as NLS, you should feel free to enter a bug (just one single bug) on this issue. I suppose you can assign it to me for the time being.... :-)


Michael S. Kaplan on 2 Sep 2010 6:48 PM:

I must admit that the number and varied nature of responses I have gotten has been interesting -- of the responses, about 30% found the back story to be distracting (1 of them classified the distraction as "disappointing"), about 40% liked that I wrote about something that non-technical people could enjoy as well, and the remaining ones thanked me for the way the bulk of the actual bug description was kept separate from the back story....

Alex Cohn on 5 Sep 2010 9:16 AM:

Thanks, it's probably the best post in this blog. IMVHO, the balance of personal and technical is what makes blogs different from "MSDN newsletter", etc.


go to newer or older post, or back to index or month or day