They ask me "why is my Korean text in random order?"

by Michael S. Kaplan, published on 2004/11/30 03:57 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2004/11/30/272207.aspx


I get this question on a regular basis -- people wonder if I know that Korean shows up in a random order. People expect this to be the case when the Korean LCID (0x0412, or 1042 in decimal) is not passed, but when it is they expect things to be in some kind of correct order, and as far as they can tell, they are not.

My first big question (to help set expectations properly) is to ask what order they expect to be used. They almost always have trouble explaining what they thought would happen. But usually with a little help they make it through this part. The order is based on the most common Hangul pronunciation of the character, whether it is Hangul or Hanja (the Korean name for Han ideographs).

At this point they start to see some pattern to the results but it still seems random within a particular pronunciation.

My second big question is to ask how they are getting an order. This varies -- if they are a developer then they are calling LCMapString or CompareString (or some API that calls CompareString), otherwise if its in an application (Access or Excel or whatever) then they do not know what API is being called (but usually I know what the application is doing). Or they are using managed code and the CompareInfo or SortKey classes are being used. The problem is usually that they are passing the NORM_IGNORENONSPACE flag (or the CompareOptions.IgnoreNonSpace flag in managed code). And herein is where the problem lies....

You see, modern Korean Hangul and Hanja do not have the notion of non-spacing characters (like ˆˇˉ˘˙˚˛˜˝ diacritics seen in Latin), so that part of the "collation weight space" is used for Hanja that most commonly have the same Hangul pronunciation. Telling the API or method to ignore this weight is basically asking it to treat (for example) such ideographs as 渴, 噶, and 鶡 as randomly sorting together in a non-deterministic way, since the API will report that the two strings are equal. The same thing happens to letters with diacritics in Latin -- passing the flag will cause (for example) A, Ā, Ă, Ą, À, Á, Â, Ã, Ä, and Å to sort together randomly in the same sort of non-deterministic way.

Ah, the light begins to dawn -- the order was not actually so random as they thought. They should not be passing this flag!

The next question they ask is "what is so special about Korean that it is the only language where this is done?" The answer to this is simple: its not.

The names for the flags NORM_IGNORENONSPACE/NORM_IGNORECASE (or IgnoreNonSpace/IgnoreCase) are really misleading, since it really meant to refer to Latin script diacritics and case, which need to weigh between 'A' and 'Ă' as having a secondary difference and between 'A' and 'a' as having a tertiary difference. There are a myriad of languages that have the same linguistic need to express secondary and tertiary differences, so they need to use those types of weights. Passing the flags can harm the accuracy/specificity of Thai, Hindi, Tamil, Telugu, Arabic, Hebrew, and a whole lot more.

The final question that they ask is "But I am doing searches and I need to pass those flags for the Latin/Cyrillic/Greek text. How can I have the results work properly in those searches without finding the wrong characters for these other languages?" The answer here is to use the flags to do the search operation but then order them without using the flags. If you do this then you will get the longer list of candidates, but just as in typical web searches the closest matches will appear higher on the list!


# Scott Hanselman on Tuesday, November 30, 2004 11:56 AM:

I just wanted to let you know that I'm totally digging your posts. Keep it up.

http://www.hanselman.com/blog/SearchView.aspx?q=internationalization

# michkap on Tuesday, November 30, 2004 2:13 PM:

Always good to know people are having fun... after eight years at Microsoft (six as a vendor, two as a full-time employee) I could probably have a post a day for some time. :-)

# Norman Diamond on Tuesday, December 21, 2004 6:41 PM:

You need to use separate flags for those cases. Not every combination of Korean and French text is a Korean-Japanese dictionary, but even if it were, you should not overload one flag in that manner.

A decision on whether or not to ignore nonspace is separate from a decision on whether or not to ignore spaces (or what C calls whitespace) is separate from a decision on whether or not to ignore casing (upper/lower/title) is separate from a decision on whether or not to ignore width (at least in Japanese character sets) is separate from a decision on whether or not to ignore accents is separate from a decision on whether or not to ignore bug reports.

# Michael Kaplan on Tuesday, December 21, 2004 10:12 PM:

I think you missed my point here. If you pass these flags for Korean text then you will destroy meaningful comparisons.

So if you care about Korean (and many other language) text then these flags may not be in your best interest....

# Norman Diamond on Tuesday, December 21, 2004 11:16 PM:

12/21/2004 10:12 PM Michael Kaplan

> I think you missed my point here.

I think not.

> If you pass these flags for Korean text then
> you will destroy meaningful comparisons.

Yes, because your company abused those flags by squashing multiple meanings into a single flag when separate flags are really necessary. Yes users suffer when they do not know how to contend with that design, and I think I got your point. You know how to improve this part of the design (plus I reminded you of how) and I think you missed my point.

# Michael Kaplan on Tuesday, December 21, 2004 11:36 PM:

So lets stop picking on the person trying to document the issues, shall we? :-)

Seriously, the purpose of this blog is not discuss the evils of Microsoft. I would seriously recommend that you find a better forum to advance that agenda than the comment section of my blog?

# Norman Diamond on Wednesday, December 22, 2004 1:54 PM:

> So lets stop picking on the person trying to
> document the issues

If you don't have the power to fix them, then I understand. Sorry if I misunderstood.

But please let me add a bit to the explanation because this was really bugging me all night, it is not really a trivial issue. This will be highly imperfect as an analogy, but I hope it will still point out why the issue is not trivial.

Suppose function ToUpper converted "Jan" to "JANUARY". The person who designed ToUpper that way could warn people about it, and you could warn people about it, but people who don't know about the warning will be surprised. Next suppose function ToUpper converts "1" to "UNO". People who really only want an uppercasing function will have to roll their own in order to get it.

Yes the warning from the documenter is valuable. But please also teach developers not to repeat this kind of mistake. Separate meanings need separate flags or need separate functions.

# Michael Kaplan on Wednesday, December 22, 2004 2:05 PM:

Norman, i have the power to fix many things.

I do not, however, have the authority to break applications that exist, nor do I have the desire to. So I would not do that.

Now I do recognize the problem here. I have identified it, put the identification in a public plsce, suggested both the solution to the problem and a workaround for the situations where the solution would break other behavior.

I am also looking at other possible options for the future.

In my opinion, there is absolutely nothing more that even someone more important than I *could* do, or should do.

# Norman Diamond on Thursday, December 23, 2004 5:37 PM:

OK, thank you.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2006/11/17 Korean + French == ????

2006/11/01 If you add enough characters to a sort, intuitive distinction can suffer

2005/12/29 What's a secondary distinction?

2005/08/02 New in Vista Beta 1: more use of the word 'linguistic'

2005/06/30 Query collation source data?

2005/05/11 Case/kana/accent/width sensitive SQL Server, for testing

go to newer or older post, or back to index or month or day