You can't ignore diacritics when a language does not give them diacritic weight

by Michael S. Kaplan, published on 2005/02/05 05:55 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/05/367666.aspx

This post is about a not entirely intuitive fact that will be seen in the implementation of collation in Microsoft products. It affects the results of both CompareString and LCMapString in Windows, the results of using the CompareInfo and Sortkey classes in the .NET Framework, and in the results in products like Jet and SQL Server.

To help show what is happening under the covers, I will use the sort keys.

We'll use the letters A (U+0041, LATIN CAPITAL LETTER A) and Ą (U+0104, a.k.a. LATIN CAPITAL LETTER A WITH OGONEK), as well as their lowercase counterparts.

When getting sort keys using the default table (LOCALE_INVARIANT), the weights look like the following:

a    U+0061    0E 02 01 01 02 01 01 00
A    U+0041    0E 02 01 01 12 01 01 00
ą    U+0105    0E 02 01 1B 01 02 01 01 00
Ą    U+0104    0E 02 01 1B 01 12 01 01 00

Note the Unicode weights (in blue), the diacritic weights (in green) and the case weights (in red). Now when we ignore case:

a    U+0061    0E 02 01 01 02 01 01 00
A    U+0041    0E 02 01 01 02 01 01 00
ą    U+0105    0E 02 01 1B 01 02 01 01 00
Ą    U+0104    0E 02 01 1B 01 02 01 01 00

And when we ignore diacritics:

a    U+0061    0E 02 01 01 02 01 01 00
A    U+0041    0E 02 01 01 12 01 01 00
ą    U+0105    0E 02 01 01 02 01 01 00
Ą    U+0104    0E 02 01 01 12 01 01 00

And then we ignore both:

a    U+0061    0E 02 01 01 02 01 01 00
A    U+0041    0E 02 01 01 02 01 01 00
ą    U+0105    0E 02 01 01 02 01 01 00
Ą    U+0104    0E 02 01 01 02 01 01 00

Clearly, in the default table LATIN CAPITAL LETTER A WITH OGONEK is little more than a LATIN CAPITAL LETTER A with a hook in it's foot. A small diacritic weight is added to show that it is still primarily a LATIN CAPITAL LETTER A. And the act of ignoring the diacritic gives identical results to when the diacritic was never there in the first place -- you can see it right in the weights.

Now, how about when we move to Polish, LCID 0x00000415? In Polish, LATIN CAPITAL LETTER A WITH OGONEK is a letter with a unique Unicode weight, and this causes a difference in the results:

a    U+0061    0E 02 01 01 02 01 01 00
A    U+0041    0E 02 01 01 12 01 01 00
ą    U+0105    0E 04 01 01 02 01 01 00
Ą    U+0104    0E 04 01 01 12 01 01 00

Do you see what happened here? Since in Polish LATIN CAPITAL LETTER A WITH OGONEK has a unique Unicode weight, ignoring the case weight has a predictable effect:

a    U+0061    0E 02 01 01 02 01 01 00
A    U+0041    0E 02 01 01 02 01 01 00
ą    U+0105    0E 04 01 01 02 01 01 00
Ą    U+0104    0E 04 01 01 02 01 01 00

And Ignoring the diacritic weight will have no effect whatsoever (since there is no diacritic weight to ignore):

a    U+0061    0E 02 01 01 02 01 01 00
A    U+0041    0E 02 01 01 12 01 01 00
ą    U+0105    0E 04 01 01 02 01 01 00
Ą    U+0104    0E 04 01 01 12 01 01 00

So the net effect is that for Polish, passing a NORM_IGNORENONSPACE flag in Windows, a CompareOptions.IgnoreNonspace in the .NET Framework, or a collation in SQL Server such as Polish_CI_AI (Polish, case insensitive, accent insensitive) will never see LATIN CAPITAL LETTER A WITH OGONEK as a LATIN CAPITAL LETTER A. Because Polish does not give the letter diacritic weight.

This is a common issue, whether you look at å (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE) in Swedish, Č (U+010c, a.k.a. LATIN CAPITAL LETTER C WITH CARON) in Slovenian, or any of the other hundreds of examples that exist in supported collations. The key is that in each case you must consider not only whether the character appears to have a diacritic on them but how the language is looking at the string....

This post brought to you by "Ą" (U+0104, a.k.a. LATIN CAPITAL LETTER A WITH OGONEK)

# CN on 5 Feb 2005 7:40 AM:

Are there any "diacritic" characters that are treated like they were separate letters in the default table?

# RolfBjarne on 5 Feb 2005 8:10 AM:

Even though this might seem a little bit strange, it is indeed correct, for even though the letters "a" and "å" look very similar, in Norwegian (and Swedish) they are two different letters, just like "a" and "b" in English.

# Michael Kaplan on 5 Feb 2005 10:43 AM:

CN -- no, the default table stays pretty clean of that sort of thing. It is something for specific languages with different conventions.

RolfBjarne -- exactly. But it is amazing how often it is considered a bug when a developer runs across the issue, especially in other products like SQL Server or Outlook.

# CN on 5 Feb 2005 1:04 PM:

Do they ever encounter that "Vindovs" is equal to "Windows" while using the same flag? To get back to the original question, that is one case where the default table will consider characters as different, while a few locales think that they are identical/variations of each other, and within the set of ordinary latin letters.

BTW, I know this is not the place to mention it -- but the signature listed for CompareString in MSDN Library is wrong! When I first looked it upp I started wondering if it only took atom strings or something else that could be mapped to a DWORD easier than a pointer. Thankfully (in this case), the parameter names were Hungarian...

# Michael Kaplan on 5 Feb 2005 1:09 PM:

Well, not true ever in the default table, but in some languages the V and the W are to be treated the same. See Finnish and Swedish, for example: http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C3.asp

The CompareString doc bug was reported by someone here, before. See http://blogs.msdn.com/michkap/archive/2005/02/02/365251.aspx#366109 . :-)

# CN on 5 Feb 2005 2:37 PM:

I know it isn't true in the default table, but I meant that there is a case where the default table makes a distinction, while the locale tables do not. Your examples with the accented characters were the other way around, and I found it interesting that they are in both ways.

Oooops, and so recent... I guess I read that posting while it was fresh and then never got back to it. Hmhm...

# Michael Kaplan on 5 Feb 2005 4:41 PM:

Ah, I see what you are saying -- yes, you are right. Things really do go in both directions.

Hitesh Patel on 23 Nov 2012 4:02 AM:

Hello Sir,

I have a problem my question when i fetch the data in may class and set in text view in android not proper character display it.and ask question in stack overflow but anyone proper answer suggest me .please help me sir.and My Question link below:

stackoverflow.com/.../how-can-i-display-latin-words-in-android

Advance Thanks!!

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/09/14 A&P of Sort Keys, part 4 (aka It isn't a race but let's make an EXCEPTION and cross the Finnish line)

2007/04/11 Microsoft is not uncaron^H^Hing about the issue!

2007/02/26 The search for someone who does Search correctly

2007/02/25 à ≠ a (unless à = a)

2006/10/12 Who is the Hacek Girl?

2006/01/29 Handling multilingual data in SQL Server

2006/01/12 The creation of sort keys does not always make sense

2005/12/29 What's a secondary distinction?

go to newer or older post, or back to index or month or day