by Michael S. Kaplan, published on 2005/02/05 05:55 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/05/367666.aspx
This post is about a not entirely intuitive fact that will be seen in the implementation of collation in Microsoft products. It affects the results of both CompareString and LCMapString in Windows, the results of using the CompareInfo and Sortkey classes in the .NET Framework, and in the results in products like Jet and SQL Server.
To help show what is happening under the covers, I will use the sort keys.
We'll use the letters A (U+0041, LATIN CAPITAL LETTER A) and Ą (U+0104, a.k.a. LATIN CAPITAL LETTER A WITH OGONEK), as well as their lowercase counterparts.
When getting sort keys using the default table (LOCALE_INVARIANT), the weights look like the following:
a U+0061 0E 02 01 01 02 01 01 00
A U+0041 0E 02 01 01 12 01 01 00
ą U+0105 0E 02 01 1B 01 02 01 01 00
Ą U+0104 0E 02 01 1B 01 12 01 01 00
Note the Unicode weights (in blue), the diacritic weights (in green) and the case weights (in red). Now when we ignore case:
a U+0061 0E 02 01 01 02 01 01 00
A U+0041 0E 02 01 01 02 01 01 00
ą U+0105 0E 02 01 1B 01 02 01 01 00
Ą U+0104 0E 02 01 1B 01 02 01 01 00
And when we ignore diacritics:
a U+0061 0E 02 01 01 02 01 01 00
A U+0041 0E 02 01 01 12 01 01 00
ą U+0105 0E 02 01 01 02 01 01 00
Ą U+0104 0E 02 01 01 12 01 01 00
And then we ignore both:
a U+0061 0E 02 01 01 02 01 01 00
A U+0041 0E 02 01 01 02 01 01 00
ą U+0105 0E 02 01 01 02 01 01 00
Ą U+0104 0E 02 01 01 02 01 01 00
Clearly, in the default table LATIN CAPITAL LETTER A WITH OGONEK is little more than a LATIN CAPITAL LETTER A with a hook in it's foot. A small diacritic weight is added to show that it is still primarily a LATIN CAPITAL LETTER A. And the act of ignoring the diacritic gives identical results to when the diacritic was never there in the first place -- you can see it right in the weights.
Now, how about when we move to Polish, LCID 0x00000415? In Polish, LATIN CAPITAL LETTER A WITH OGONEK is a letter with a unique Unicode weight, and this causes a difference in the results:
a U+0061 0E 02 01 01 02 01 01 00
A U+0041 0E 02 01 01 12 01 01 00
ą U+0105 0E 04 01 01 02 01 01 00
Ą U+0104 0E 04 01 01 12 01 01 00
Do you see what happened here? Since in Polish LATIN CAPITAL LETTER A WITH OGONEK has a unique Unicode weight, ignoring the case weight has a predictable effect:
a U+0061 0E 02 01 01 02 01 01 00
A U+0041 0E 02 01 01 02 01 01 00
ą U+0105 0E 04 01 01 02 01 01 00
Ą U+0104 0E 04 01 01 02 01 01 00
And Ignoring the diacritic weight will have no effect whatsoever (since there is no diacritic weight to ignore):
a U+0061 0E 02 01 01 02 01 01 00
A U+0041 0E 02 01 01 12 01 01 00
ą U+0105 0E 04 01 01 02 01 01 00
Ą U+0104 0E 04 01 01 12 01 01 00
So the net effect is that for Polish, passing a NORM_IGNORENONSPACE flag in Windows, a CompareOptions.IgnoreNonspace in the .NET Framework, or a collation in SQL Server such as Polish_CI_AI (Polish, case insensitive, accent insensitive) will never see LATIN CAPITAL LETTER A WITH OGONEK as a LATIN CAPITAL LETTER A. Because Polish does not give the letter diacritic weight.
This is a common issue, whether you look at å (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE) in Swedish, Č (U+010c, a.k.a. LATIN CAPITAL LETTER C WITH CARON) in Slovenian, or any of the other hundreds of examples that exist in supported collations. The key is that in each case you must consider not only whether the character appears to have a diacritic on them but how the language is looking at the string....
This post brought to you by "Ą" (U+0104, a.k.a. LATIN CAPITAL LETTER A WITH OGONEK)
# CN on 5 Feb 2005 7:40 AM:
# RolfBjarne on 5 Feb 2005 8:10 AM:
# Michael Kaplan on 5 Feb 2005 10:43 AM:
# CN on 5 Feb 2005 1:04 PM:
# Michael Kaplan on 5 Feb 2005 1:09 PM:
# CN on 5 Feb 2005 2:37 PM:
# Michael Kaplan on 5 Feb 2005 4:41 PM:
Hitesh Patel on 23 Nov 2012 4:02 AM:
Hello Sir,
I have a problem my question when i fetch the data in may class and set in text view in android not proper character display it.and ask question in stack overflow but anyone proper answer suggest me .please help me sir.and My Question link below:
stackoverflow.com/.../how-can-i-display-latin-words-in-android
Advance Thanks!!
referenced by
2007/04/11 Microsoft is not uncaron^H^Hing about the issue!
2007/02/26 The search for someone who does Search correctly
2007/02/25 à ≠ a (unless à = a)
2006/10/12 Who is the Hacek Girl?
2006/01/29 Handling multilingual data in SQL Server
2006/01/12 The creation of sort keys does not always make sense
2005/12/29 What's a secondary distinction?