Whats up with the Korean (Unicode) sort?

by Michael S. Kaplan, published on 2004/12/14 02:44 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/14/284838.aspx

I had this conversation a little over two years ago in the Netherlands on the end of the last day at a conference. It may not be word for word, though I actually think it comes pretty close (its not like I had a tape recorder). The cookies were Pepperidge Farm Mint Milanos, but I do not like mint (I love the non-mint varieties, I am not sure how I ended up with the ones I did - it might have been a mistake to mention I did not like them).

Oh, also the name of woman I talked to is not really Andrea; I just like the name and do not mind the nod to Jubal Harshaw....

Me: I'd actually rather give you one of these cookies. They are really good. Plus its less embarrassing than the answer to your question.

Andrea: I know you hate mint, you said so yesterday at the luncheon. C'mon Michael!

Andrea: Ok, no Russian bears. So tell me, why is the Korean Unicode sort embarrassing? I could not find it defined anywhere, except maybe I found a vague hint to the 'Unicode collation' setting that was used in SQL Server 7.0, which could be Korean. Is that it?

Me: No, that's not what it is. Though SQL Server does have a "Korean Unicode collation" of its own that matches the one that used to be on Windows.

Andrea: Grrr. You are infuriating, Michael. What is the Korean Unicode sort? The one that is in SQL Server, the one that used to be in Windows, the one that is still in the header files. What is it?

Andrea: Almost? How close is almost? Sounds like almost hitting a home run, but what kind? Was it an almost home run that was a strike out, or an almost home run that was a triple?

Me: Ouch! Well, if you put it that way, I guess you could say it's a strike out.

Andrea: What character is it? Something insulting to a government? Did Microsoft upset the Korean premier or something?

Me: No, nothing like that. Its U+005c, the "REVERSE SOLIDUS". Also known as the backslash. Not insulting at all.

Andrea: One of us has to be missing something, Michael. Maybe you had better give me a cookie.

Andrea: You said in your talk today that there is room for over a million characters in Unicode. There is no room for a dedicated Won?

Me: Oh, there is a dedicated Won Sign at U+20a9. Its just that in most Korean fonts a character that looks like a Won is put in the slot for U+005c, and since the characters look the same we try to make sure that they are treated as if they were the same.

Andrea: Ok, I see that. But why is it called the Korean Unicode sort. If its legacy then that would make it the Korean ANSI sort, right?

Andrea: You know what I mean, Michael. Are you this exasperating when you talk with your girlfriend?

Andrea: Just kidding. But I was up too late last night and you already gave me the cookies. So I have no real need to flirt when I am teasing at this point.

Me: Hmmmm, no one ever used to have a need. Anyway, I know what you mean. It probably would have made more sense to tie it to the Korean standard, except thats encoding and not sorting. And they basically do put the won at 0x5c in their encoding standard, so MS is just trying to be consistent. It would have been really weird trying to tie to KSC-5601.

Andrea: I can definitely see that. So, what about the rest of the Hangul and Hanja and Jamo and whatnot that is used in by Koreans?

Me: Well, now you understand why it was probably removed from Windows -- because it does not really do much for Korean.

Me: I know you think that I am a bigwig at Microsoft, but I'm not. I was offered a job there but I haven't even started yet. And I am definitely not "in the know" about what they do in SQL Server.

Andrea: No need to be shirty, dear. I understand. I apologize for thinking you were important.

Andrea: Ok, and I apologize for teasing you now. But back to the Korean thing.... do you have a guess?

Me: My guess is that since there is a serious worry about backward compatibiliy and sort orders in SQL Server, and they can't really get rid of something as easily, even if it is useless. I guess they could have hacked it since its only different by one character, but they are a team that is astoundingly against hacks. Thats something I can respect.

Me: Maybe. If PSS gets customers wondering where good old 0x00010412 went, I'll suggest it.

Me: No worries, the group is gone, the conference is mostly over. Hell, I'd probably be flying out tonight if there were a flight. You can come out with us tonight if you want. Well, that is if we are going anywhere.

Andrea: Actually, you can come out with us. My friends are more socially adept than yours.

Needless to say, the conversation devolved at that point. But Andrea did finish the cookies. I did go out with four of Andrea's friends that night and drank more than I should have. The flight home was harder with a hangover, and to be perfectly honest it was not until I sat down to try and remember the whole conversation earlier tonight that I remembered I was supposed to follow up with PSS.

Good day. When I have time I'll try to read more of your blog, but after being pointed this way from Raymond Chen's blog, here's a few comments.

> Me: Well, because for Korean, it is also the
> Won sign (₩).

1. I doubt that. If it's anything like Japanese and the yen sign, then it isn't "also" the won sign, it _is_ the won sign. If it's anything like Japanese, there is no single-byte backslash, there might be a double-byte wide backslash but that's a different character and different codepoint, and of course there's at least two backslashes in Unicode but one of them has no counterpart in the ANSI code page.

2. When I used the mouse to copy and paste from your posting, my submission here has a won sign in it. I wonder how that came about. I can't input it. My keyboard has a yen sign. (Actually my keyboard has two of them, it has a yen sign that looks like a yen sign and generates a yen sign, and it has a yen sign that looks like a backslash and generates a yen sign. Different graphics for historical reasons, different scan codes, but the same identical codepoint and character, a yen sign.)

> Me: Well, ANSI does not have Korean in it,
> and there is no Won.

If it's anything like Japanese, that's completely wrong. The ANSI code page for Japanese is 932 and it has a single-byte yen sign, codepoint 0x5C. If I recall correctly the ANSI code page for Korean is 936 and it has a single-byte won sign, codepoint 0x5C.

ASCII doesn't have Korean or a won sign or a yen sign. It also doesn't have any codepoint larger than 127.

ANSI code pages for small character sets based originally on Italian alphabetic characters also don't have a won sign or a yen sign but do have codepoints going up to 255. I'd guess you meant to talk about these ANSI code pages, but these are kind of irrelevant in a conversation about ANSI code page 936, or a conversation about trying to create some form of compatibility between Unicode and ANSI code page 936.

I don't know if the number of meaningful sort orderings for Korean is larger or smaller than for Japanese. In Japanese I can't imagine labelling just one of them as "the Japanese (Unicode) sort". I can guarantee that Windows doesn't have a sort ordering that would match my local phone book. Doesn't matter if it's Unicode or not. If you find what I did, you could create a sort ordering for it, but you don't have it now.

Hello Norman,

#1 -- Actually, collation on Windows always uses Unicode, and every Korean sort that has ever existed on Windows has put that backslas character U+005c as looking like the Won.

#2 -- It *is* the Won, I made it the real Won so that you did not have to have a Korean system locale to read the article and see it.

I meant the Microsoft meaning of ANSI, where this is no dedicated Won character other than the thing that is the backslash in payhs.

What you see is exactly what I was describing, except you see it for the Yen. Except since Andrea had neither Japanese nor Korean settings, I was explaining it to someone who sees a backslash.

None of the so called "Unicode" sorts for Japanese or for Korean are meaningful -- that was my point. They are not only meaningless, they are also useless....

12/19/2004 10:24 PM Michael Kaplan

> Actually, collation on Windows always uses
> Unicode

I'm still not quite sure of the relevance. A sort ordering based on codepoint comparisons could be useful for the same purposes as the invariant ordering, i.e. not for any purpose usable in communicating information to humans. All other sort orderings must be based on some characteristic other than the codepoint values, in which case it doesn't matter what encoding scheme you use, you must get the things sorted by the chosen characteristic.

> and every Korean sort that has ever existed
> on Windows has put that backslas character
> U+005c as looking like the Won.

Aha, that is useless. The character U+005c is a backslash, it isn't a single-byte character, and if Korean character sets are anything like Japanese then it doesn't even exist in Korean character sets (in other words it doesn't exist in Japanese character sets). It doesn't look like a won sign and it doesn't look like a yen sign, it just isn't displayable unless you change fonts. And it certainly isn't the single-byte character with codepoint 0x5c, because in ANSI code page 936 codepoint 0x5c is a won sign and in ANSI code page 932 codepoint 0x5c is a yen sign. Japanese character sets do include a wide character, double-byte backslash, which can be used in displaying a backslash if you don't mind the fact that it appears wider than the one you wanted.

(By the way the ISO and ANSI committees on the C and C++ languages screwed up with it too.)

> I meant the Microsoft meaning of ANSI

Huh? I've read a few dozen MSDN pages which seem to be aware of the fact that there are a ton of ANSI code pages. One of those code pages includes a single-byte won character, one of those code pages includes a single-byte yen character, and others don't.

I guess what I am saying is that for Japanese and Korean, we MUST sort U+005c in the same way that we sort those other characters since they look the same. This happens even in the valid code pages.

In addition, somebody thought it would be a good idea to add some sorts that *only* do this and nothing else. The point of my posts here is that this was eventually recognized as a bad idea and removed.

Thus these articles that describe two awful sorts that we are happy to be rid of. :-)

12/20/2004 4:54 PM Michael Kaplan

> I guess what I am saying is that for
> Japanese and Korean, we MUST sort U+005c
> in the same way that we sort those other
> characters since they look the same.

But they don't look the same. U+005c is a backslash. The single-byte Korean codepoint 0x5c is a won sign and does not look like a backslash. The single-byte Japanese codepoint 0x5c is a yen sign and does not look like a backslash.

Regarding looks, there is no way to display a U+005c without switching fonts. (Though at least in Japanese it's possible to display a wide character that looks very close to it because it's also a backslash.)

Regarding sorting and other internal operations, U+005c exists even though it can't be displayed, and it should be sorted as the backslash that it is.

> They look identical on a Japanese or a
> Korean setting.

U+005c cannot be displayed in a Japanese setting. The character that looks closest, well sorry I don't want to take time to look it up now, but you know what a full-width double-byte backslash looks like. It does not look like a yen sign.

In a NON-UNICODE sort, in a sort based on Japanese encoding, of course the codepoint's value 0x5c should sort as 0x5c. There it is not a backslash, it is a yen sign or it is a won sign or whatever. (And also of course this is still a sort for some purpose other than human interaction, since human-oriented sorts such as phone books still have the same issues that they have regardless of which binary encoding system is used for them.)

> you are in the minority in both Korea and
> Japan....

Indeed I think so, and here's the reason: up to this point I was trying to give serious treatment to Unicode sorts as you were trying, instead of ANSI codepage sorts.

Actually, no. Even the Unicode sort does this, for valid reasons in the marketplace. Beyond that, note that there is no separate sort for non-Unicode. The "A" APIs convert and call the "W" APIs. There is only one set of tables for collation.

Note that this happens on *all* Japanese and Korean sorts.

THIS posting was a recapturing of an explanation for a sort that does this and nothing else. Since you think its a bad idea anyway, perhaps we can just agree that a sort that does only this was a bad idea and then walk away....

> Actually, no. Even the Unicode sort does
> this, for valid reasons in the marketplace.

When you say "does this", I guess "this" means sort U+005c the same way as a code page's 0x5c was sorted? Which marketplace wants that?

For a Unicode sort other than the invariant one, I have the impression that there was some effort to make the sort somewhat compatible with a non-Unicode sort, which would not have been a Microsoft "A" API calling a Microsoft "W" API, but would have been a computer-centric sort ordering based on code points in a national or linguistic code page. In Japanese, the yen sign comes between the left bracket and the right bracket, so you would want U+whatever the code point is for yen sign to come between U+005b and U+005d. In Korean you would want U+whatever the code point is for won sign to come between U+005b and U+005d. Then, even though old databases don't get their contents transcoded into Unicode, new databases that use Unicode could get sorted the same way as the old databases got sorted.

Since a single-byte backslash didn't exist in the old code page, U+005c could be added to the new Unicode whichever-variant sort ordering, in places where other characters get added.

Japanese government databases cannot store my wife's name. I recommended to them to misspell my wife's name to approximate the pronunciation rather than approximating the appearance, because other Japanese adaptations of foreign words usually try to approximate pronunciations. If a future government database uses Unicode, if it will become possible to store my wife's name, it doesn't necessarily mean that the new character should get squashed into the same sorting position that Latin-1 put it in.

Of course all of the above are not phone book sorts, they are just ways to match the existing national-but-not-phonebook sorts.

12/22/2004 1:58 PM Michael Kaplan

> Both chracters (the Yen and the Won) sort in
> the same place as the backslash on Microsoft
> platforms when you specify you want Japanese
> or Korean as a default user locale.

I'm afraid I don't understand this. When Japanese or Korean is the default user locale, there is no single-byte backslash, so how can anything else sort in the same place as that? If you mean that the character whose codepoint is 0x5c sorts in between characters whose codepoints are 0x5b and 0x5d, then I'd say it looks pretty reasonable.

> I assume you run with a Japnese user locale.

No kidding. If most computers sold in your country are sold with your country's locale set as the default, and most of the things that you use them for at both work and home work at least as well under that locale as under alternatives, then wouldn't you usually refrain from switching?

> If you have never noticed a problem before
> then you likely do not object to the
> behavior.

That much is true, for at least three reasons.

1. I haven't usually needed to do that kind of sorting. When I need to remove duplicates from a list, it is convenient to sort and then weed out adjacent lines that are duplicates, but it doesn't really matter what order they're in.

2. When Outlook Express doesn't even sort things in the same order as Outlook Express, it's good for laughs, but it isn't a problem (at least for me). I don't know which sort rules it's using and don't care. Again no objection, just a smirk.

3. When Windows Explorer sometimes sorts things differently than the way previous versions of Windows Explorer used to sort them, sometimes it becomes a nuisance. I don't want to take the time to write details right at the moment. But this does not involve symbols, so again it is not an objection to the item that you're mentioning.

Try it this way, maybe it will help: :-)

Pretend there is nothing but Unicode, since from my point of view, there isn't. The "A" version of the function just converts to Unicode anyway....

If you are not using Unicode then the post is not relevant to you, but if you are not using Unicode then almost half of the characters that the government put into JIS x213 are unavasilable to you, so I would recommend upgrading to Unicode at some point. :-)

When you select the Korean LCID (0x0412), U+005c will sort equivalently to U+2089 (WON SIGN). When you select the Japanese LCID (0x0411), U+005c will sort equivalently to U+00a5 (YEN SIGN).

For the numbered sections (I will recommend you number from now on so the references are more obvious <grin>):

#1 -- makes sense

#2 -- no idea what you mean here, but I'd rather not go there, it stinks of an "ill wind" direction for conversation....

#3 -- they use almost the same API except the Shell does the "sort ASCII digits as numbers" thing (cf: StrCmpLogicalW). I will be talking about that some other day, don't worry....

12/23/2004 7:36 PM Michael Kaplan

> if you are not using Unicode then almost
> half of the characters that the government
> put into JIS x213 are unavasilable to you

I'm not sure how many of the government's own machines have fonts capable of displaying those, and/or allow (politically allow) their usage. In business documents I've neither seen nor used them. In experimentation around 10 years ago, I saw them in EUC, didn't see them in Shift-JIS, and didn't see Unicode in use yet. Outside of experimentation I didn't see them used even in EUC.

> When you select the Korean LCID (0x0412),
> U+005c will sort equivalently to U+2089
> (WON SIGN). When you select the Japanese
> LCID (0x0411), U+005c will sort equivalently
> to U+00a5 (YEN SIGN).

"Equivalently" sounds fine to me. Now, in each case does that location fall in between U+005b and U+005d? If yes, then it's compatible with sorting in each code page. If no, then I think it's pretty obvious why no one wanted to use it.

> #2 -- no idea what you mean here

It becomes visible depending on which newsgroups you subscribe to and if you watch carefully when it's downloading. (This doesn't mean you have to hunt it down, unless you wish. As mentioned I don't object but only smirk, and I don't think it's a problem.)

> #3 [...] "sort ASCII digits as numbers"

OK, I haven't read it enough. I thought I had vaguely read a summary that it sorted numerals as numbers and that it had been internationalized. If it's only intended to work with ASCII then it's meeting its intent, but the result is less consistent than the old style.

If you are not using the characters then no worries. I'd still recommend moving to Unicode as you will otherwise be more and more likely each year to start running into problems....

It is not between U+005b and U+005d. The sort you refer to is not compatible with what Windows has been doing since NT 3.51 and Windows 95 JPN. No one has complained yet, though....

For the "digits as numbers" stuff, it only does ASCII digits. Like Is said, I'll talk sbout it more another day. :-)

I use the characters, I just didn't need to worry about what the sort ordering was. As already mentioned, most of the times I've needed to do sorting, it was just to put lines with equal values next to each other so I could weed out duplicates, I didn't need to worry about the ordering.

As for those who do need to worry about the ordering, and who need it to be the same order as in other systems whose character sets include JIS-Romaji (i.e. not IBM mainframes), I did an experiment. I opened a cmd.exe window, typed the sort command a few times, and typed some input each time.

> It is not between U+005b and U+005d.

Right, I discovered that. So I see why no one likes it.

> The sort you refer to is not compatible with
> what Windows has been doing since NT 3.51
> and Windows 95 JPN.

For some reason I never noticed if Windows NT4, 95, and 98 had a sort command in their cmd.exe or command.com windows. I'll have to experiment with them some time.

> No one has complained yet, though....

Now I'm confused again. You started this and a related note with comments on the unpopularity of the move of the (in each case one) national character with code point 0x5c, to a position where it would not fall between 0x5b and 0x5d. Surely that is because there were complaints?

But as you say, there weren't complaints about it in Windows 95. If there were complaints about it in Windows 2000 but not in Windows 95 and 98, surely it's because corporations didn't put their databases on Windows 95 systems but they do put databases on Windows 2000 and 2003 systems. (Here I mean their maybe large databases used for corporate operations, not Access databases.) Whether there would be other needs for things to sort the same way as the JIS-Romaji standard says they will, I can't think of any offhand, and we'd need advice from people who do. From the base note, it sounds like you had some.

> The sort you refer to is not compatible with
> what Windows has been doing since NT 3.51
> and Windows 95 JPN.

Over the weekend I experimented with Windows 95 Japanese and Windows 98 Japanese. Their sort commands put the yen sign in between the left bracket and right bracket, exactly where it belongs in this kind of simple sort. Sorry I didn't experiment with Windows NT4 yet, I plan to put one in a Virtual PC when I have time.

Anyway, the JIS-Romaji standard defines code points and a simple sort by codepoints using grade 2 arithmetic[*] says that the character whose codepoint is 0x5c comes between the character whose codepoint is 0x5b and the character whose codepoint is 0x5d. Windows 95 and 98 did it, Windows XP doesn't do it, I plan to test Windows NT4, and I forgot about Windows 2000.

So when your company moved one codepoint in defining a Korean (Unicode) sort and Japanese (Unicode) sort, your company moved the wrong one. There's no point telling Japanese and Korean users that now they can suddenly start sorting their own symbols the way foreigners have been sorting their symbols. You need to tell them that if they find the means to switch to Unicode then at least they will still be able to sort their own symbols the way they always have been. I think there would be a difference in the degree of popularity.

> and remember, I am talking about U+005c

But that's not the character that you moved. That is a narrow-width backslash, no one in Japan (and I'd guess no one in Korea) ever sorted one in national code pages, and in Japanese and Korean sort orderings that character does not belong between U+005b and U+005d.

[* Previously I said "sort the same way as the JIS-Romaji standard says they will" which is not really accurate, JIS Romaji defines the codepoints and grade 2 arithmetic had something about meanings of < and > on numbers.]

I cannot speak about Windows 95 -- not only is no longer a supported operating system, but I am not running it anywhere and it has literally been years since I have seen its source code.

On NT-based systems, the code point is sorted where the respective currency sign is located, for both 0x0411 and 0x0412 (as it was for 0x10411 and 0x10412 when they existed).

I also do not really use non-Unicode stuff, but I can see that 0x5c maps to U+005c on both codepages 932 and 949.

I understand what you believe to be correct here, and we may have to simply agree to disagree at this point. We are simply going around and around without injecting any new information....

> I cannot speak about Windows 95

Well you did, and I quoted and responded.

> no longer a supported operating system

Now biting my tongue off.

> I also do not really use non-Unicode stuff,
> but I can see that 0x5c maps to U+005c on
> both codepages 932 and 949.

I don't understand. 0x5c is 0x5c, it is a single-byte character in a code page, but it is not U+005c.

If converting from code page ASCII to Unicode then 0x5c converts to U+005c. If converting from code page 932 to Unicode, then 0x5c should convert to U+00a5. If converting from code page 936 to Unicode, then 0x5c should convert to U+2089. Converting Kanji from code page 932 to Unicode does work, and I don't recall trying it with a yen sign but if it doesn't work then the converting routine is broken.

The reverse is a bit longer to describe.

If converting from Unicode to code page some-8bit-extension-of-ASCII then U+005c converts to 0x5c, U+00a5 converts to 0xa5, and U+2089 converts to the substitution value for unconvertible characters. If converting to strict ASCII then U+00a5 also converts to the substitution value.

If converting from Unicode to code page 932 then U+005c converts to the substitution value, U+00a5 converts to 0x5c, and U+2089 converts to ... if Shift-JIS has a won sign then that's it, otherwise the substitution value. Converting Kanji from Unicode to code page 932 does work, and I don't recall trying it with a yen sign but if it doesn't work then the converting routine is broken.

If converting from Unicode to code page 936 then U+2089 converts to 0x5c, and I don't know if either of the others are convertible or not.

I wrote broken conversion routines in the past (not for Unicode) and when the brokenness was observed I fixed them immediately.

> we may have to simply agree to disagree at
> this point.

You told readers that you got complaints, and I'm trying to show you why.

Last post, dude....

On a Japanese system, the Japanese system font will cause U+005c to also look like a yen. Everywhere. Including file psths, its impossible to miss -- yens everywhere.

So the comparison function will make U+00c5 look equivalent to U+00a5. Since they look the same, they will compare the same. This IS what most people expect.

And if you convert U+005c to code page 932, guess what it converts to? Yep -- 0x5c. When you convert back to Unicode, you get U+005c.

Now through the masgic of "best fit" mappings (which I will cover another day), U+2089 also maps to 0x5c on cp 932. Its a one-way mapping, obviously. But its there. I am staring at the tables now.

If it were not, then it would convert to the default character. Which is definitely not an improvement.

I understand you do not like it, I understand you think its wrong. I do not even totally disagree with you.

As I said, this is the last post. You can have the last word in your blog. :-)

Ok, one more adjunct.

Look at http://www.microsoft.com/globaldev/reference/dbcs/932.htm which has the offical mapping for cp932. It only has the 0x4c to U+005c mapping, as the other (best fit) mapping is not documented; none of the best fit mappings are.

If you were right Norman, then paths would not work on systems with a Japanese default system locale -- because the path separator would be lost any time you converted to Unicode and back.

That would be very bad -- systems would not even boot!

Your adjunct is convincing, if you permit your adjunct:

> If you were right Norman, then paths would
> not work on systems with a Japanese default
> system locale -- because the path separator
> would be lost any time you converted to
> Unicode and back.

Windows paths work with a Japanese default system locale because 0x5c is the yen sign and the yen sign is the path separator. I think you are saying that Windows paths would be broken when converting to Unicode, because U+00a5 isn't the path separator in Unicode, U+005c is. You are right, it is not possible to convert both text and pathnames correctly in any API call (unless another flag is added to say which it is), and pathnames are more important (without them you don't even get to open the text).

Nonetheless...

> On a Japanese system, the Japanese system
> font will cause U+005c to also look like a
> yen.

That's on a Japanese WINDOWS system, and not everywhere, though Microsoft has been gradually extending it closer and closer to everywhere.

Dude, NT based systems are UNICODE internally. Coded page 932 is a convenience for apps that are not yet Unicode, but the OS always converts.

I can get the same visual behavior by changing my default system locale to Japanese. All conversions of 0x5c go to U+005c. And the converse is also true.

U+00a5 is the odd one out, since roundtrip conversions will do the following:

U+00a5 --> 0x5c --> U+005c --> 0x5c --> U+005c

See how quickly the Yen disappears?

Thank goodness they are considered equal by the comparison functions!