by Michael S. Kaplan, published on 2004/11/30 03:57 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2004/11/30/272207.aspx
I get this question on a regular basis -- people wonder if I know that Korean shows up in a random order. People expect this to be the case when the Korean LCID (0x0412, or 1042 in decimal) is not passed, but when it is they expect things to be in some kind of correct order, and as far as they can tell, they are not.
My first big question (to help set expectations properly) is to ask what order they expect to be used. They almost always have trouble explaining what they thought would happen. But usually with a little help they make it through this part. The order is based on the most common Hangul pronunciation of the character, whether it is Hangul or Hanja (the Korean name for Han ideographs).
At this point they start to see some pattern to the results but it still seems random within a particular pronunciation.
My second big question is to ask how they are getting an order. This varies -- if they are a developer then they are calling LCMapString or CompareString (or some API that calls CompareString), otherwise if its in an application (Access or Excel or whatever) then they do not know what API is being called (but usually I know what the application is doing). Or they are using managed code and the CompareInfo or SortKey classes are being used. The problem is usually that they are passing the NORM_IGNORENONSPACE flag (or the CompareOptions.IgnoreNonSpace flag in managed code). And herein is where the problem lies....
You see, modern Korean Hangul and Hanja do not have the notion of non-spacing characters (like ˆˇˉ˘˙˚˛˜˝ diacritics seen in Latin), so that part of the "collation weight space" is used for Hanja that most commonly have the same Hangul pronunciation. Telling the API or method to ignore this weight is basically asking it to treat (for example) such ideographs as 渴, 噶, and 鶡 as randomly sorting together in a non-deterministic way, since the API will report that the two strings are equal. The same thing happens to letters with diacritics in Latin -- passing the flag will cause (for example) A, Ā, Ă, Ą, À, Á, Â, Ã, Ä, and Å to sort together randomly in the same sort of non-deterministic way.
Ah, the light begins to dawn -- the order was not actually so random as they thought. They should not be passing this flag!
The next question they ask is "what is so special about Korean that it is the only language where this is done?" The answer to this is simple: its not.
The names for the flags NORM_IGNORENONSPACE/NORM_IGNORECASE (or IgnoreNonSpace/IgnoreCase) are really misleading, since it really meant to refer to Latin script diacritics and case, which need to weigh between 'A' and 'Ă' as having a secondary difference and between 'A' and 'a' as having a tertiary difference. There are a myriad of languages that have the same linguistic need to express secondary and tertiary differences, so they need to use those types of weights. Passing the flags can harm the accuracy/specificity of Thai, Hindi, Tamil, Telugu, Arabic, Hebrew, and a whole lot more.
The final question that they ask is "But I am doing searches and I need to pass those flags for the Latin/Cyrillic/Greek text. How can I have the results work properly in those searches without finding the wrong characters for these other languages?" The answer here is to use the flags to do the search operation but then order them without using the flags. If you do this then you will get the longer list of candidates, but just as in typical web searches the closest matches will appear higher on the list!
# Scott Hanselman on Tuesday, November 30, 2004 11:56 AM:
# michkap on Tuesday, November 30, 2004 2:13 PM:
# Norman Diamond on Tuesday, December 21, 2004 6:41 PM:
# Michael Kaplan on Tuesday, December 21, 2004 10:12 PM:
# Norman Diamond on Tuesday, December 21, 2004 11:16 PM:
# Michael Kaplan on Tuesday, December 21, 2004 11:36 PM:
# Norman Diamond on Wednesday, December 22, 2004 1:54 PM:
# Michael Kaplan on Wednesday, December 22, 2004 2:05 PM:
# Norman Diamond on Thursday, December 23, 2004 5:37 PM:
2006/11/17 Korean + French == ????
2006/11/01 If you add enough characters to a sort, intuitive distinction can suffer
2005/12/29 What's a secondary distinction?
2005/08/02 New in Vista Beta 1: more use of the word 'linguistic'
2005/06/30 Query collation source data?
2005/05/11 Case/kana/accent/width sensitive SQL Server, for testing
go to newer or older post, or back to index or month or day