Is it punctuation, symbol, or diacritic?

by Michael S. Kaplan, published on 2006/05/24 13:23 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/05/24/605603.aspx


Developer Benjamin Westbrook asked me the following question yesterday (project name obscured for no reason in particular!):

Michael, I’m debugging the Find code for 𐿿𐿿𐿿𐿿𐿿𐿿𐿿𐿿𐿿 and I wonder if you can comment on some behavior:

CultureInfo cultureInfo = new CultureInfo("ar");
int index = cultureInfo.CompareInfo.IndexOf(‎"الْكِتَاب", ‎"الكتاب", CompareOptions.IgnoreNonSpace);

This code ends with index == -1.

I have some second hand information that in the past you told another 𐿿𐿿𐿿𐿿 dev there were limitations with Arabic and the IngoreNonSpace flag.  Is this expected behavior?  Is it possible to match the two strings with existing APIs?

This is an excellent question, one that is deeply embedded in some of the designs of the default collation table, in both Vista and in prior versions of Windows (the latter matches the .NET Framework)....

Before we are done we will talk about FindNLSString, about SORT_STRINGSORT, about NORM_IGNORESYMBOLS, and about NORM_IGNORENONSPACE, and more.

It is actually a problem that former GIFT tester Ihab Abdelhalim had pointed out many times as the non-intuitive design for collation in Arabic, which has finally been fixed in Vista (now that Ihab has left the group), though the fix is not yet a part of the tables that his new group (SQL Server/WinFS) has snapshoted. Eventually, I am sure he will have a chance to appreciate the fix. :-)

First we will look at those two strings, with "identical" code points lined up to highlight the similarities:

U+0627 U+0644        U+0643        U+062a        U+0627 U+0628

U+0627 U+0644 U+0652 U+0643 U+0650 U+062a U+064e U+0627 U+0628

Now obviously there was a good faith basis for believing that the second string could be found within the first -- that these three additional characters could be in some way ignored. If not, the question would not have been asked....

We will start by getting some sort keys for these strings in order to see how they are being used, both in Server 2003 and in Vista. I'll try to line up the sort keys for easier comparison.

The two strings with no special flags, Server 2003:

13 0b 13 63 13 5f 13 13 13 0b 13 0f                   01 08 08    08    08    08 08 01 01 01                                     00
13 0b 13 63 13 5f 13 13 13 0b 13 0f                   01 08 08    08    08    08 08 01 01 01 80 0f 06 a6 80 13 06 a5 80 17 06 a3 00

The two strings with NORM_IGNORENONSPACE, Server 2003:

13 0b 13 63 13 5f 13 13 13 0b 13 0f                   01                            01 01 01                                     00
13 0b 13 63 13 5f 13 13 13 0b 13 0f                   01                            01 01 01 80 0f 06 a6 80 13 06 a5 80 17 06 a3 00

The two strings with NORM_IGNORESYMBOLS, Server 2003:

13 0b 13 63 13 5f 13 13 13 0b 13 0f                   01 08 08    08    08    08 08 01 01 01                                     00
13 0b 13 63 13 5f 13 13 13 0b 13 0f                   01 08 08    08    08    08 08 01 01 01                                     00

The two strings with SORT_STRINGSORT, Server 2003:

13 0b 13 63 13 5f 13 13 13 0b 13 0f                   01 08 08    08    08    08 08 01 01 01                                     00
13 0b 13 63 06 a6 13 5f 06 a5 13 13 06 a3 13 0b 13 0f 01 08 08 02 08 02 08 02 08 08 01 01 01                                     00

Analysis: the three additional characters that are in the middle of the one string:

are being treated as that special kind of symbol known as punctuation -- thus when you pass the NORM_IGNORESYMBOLS flag they are ignore entirely, and when you pass the SORT_STRINGSORT flag, they are treated as regular symbols.

The problem is that while it may be interesting or convenient to think of them as a type of punctuation, they are not like that unique kind of punctuation (hyphens and such) that have special "word sort" behavior, and from the feedback we have gotten it is clear that most customers generally expect that passing NORM_IGNORENONSPACE would be used to ignore them, rather than NORM_IGNORESYMBOLS.

So, let's look at the Vista weights:

The two strings with no special flags, Vista:

29 0b 29 a7 29 8f 29 20 29 0b 29 0e 01             01 01 01 00
29 0b 29 a7 29 8f 29 20 29 0b 29 0e 01 02 e4 e3 e1 01 01 01 00

The two strings with NORM_IGNORENONSPACE, Vista:

29 0b 29 a7 29 8f 29 20 29 0b 29 0e 01             01 01 01 00
29 0b 29 a7 29 8f 29 20 29 0b 29 0e 01             01 01 01 00

The two strings with NORM_IGNORESYMBOLS, Vista:

29 0b 29 a7 29 8f 29 20 29 0b 29 0e 01             01 01 01 00
29 0b 29 a7 29 8f 29 20 29 0b 29 0e 01 02 e4 e3 e1 01 01 01 00

The two strings with SORT_STRINGSORT, Vista:

29 0b 29 a7 29 8f 29 20 29 0b 29 0e 01             01 01 01 00
29 0b 29 a7 29 8f 29 20 29 0b 29 0e 01 02 e4 e3 e1 01 01 01 00

Summary -- these three different characters are now treated as diacritics or non-spacing marks, so that NORM_IGNORENONSPACE will give the appropriate behavior to ignore them, and both NORM_IGNORESYMBOLS and SORT_STRINGSORT will have no effect.

Now of course you can use the new FindNLSString to get the right answer on Benjamin's original question (with the special bonus being that you can have the length of the string that was found (which will be a few characters different than the string that was being looked for!).

If you need the answer in managed code, there are two possible answers:

But this was indeed an excellent question, and the kind of behavior that is good to dig into and explain from time to time, as the effort to improve linguistically appropriate results continues....

 

This post brought to you by " ِ " (U+0650, a.k.a. ARABIC KASRA)


Erika Qualls on 4 May 2011 8:02 AM:

What do it really mean


referenced by

2008/11/11 Trying to ignore the small stuff is harder, if you're Arabic

2007/08/12 Hello Madda, Hello Father (Iranian style)

go to newer or older post, or back to index or month or day