by Michael S. Kaplan, published on 2006/11/17 20:24 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/11/17/1097532.aspx
I was having a conversation with someone at the 30th Internationalization and Unicode Conference earlier today, and I realized after thinking about design consequences beyond his questions that I found a design flaw.
It was one of those cool kind of issues that I did not even have to try out to realize it exists. :-)
Try to compare the following two strings: "가㉮" vs "㉮가".
Basically try the following three calls:
WCHAR wz1[] = L"\uAC00\u326E";
WCHAR wz2[] = L"\u326E\uAC00;
printf(%d, CompareStringW(MAKELANGID(LANG_ENGLISH, SUBLANG_DEFAULT), 0, wz1, -1, wz2, -1));
printf(%d, CompareStringW(MAKELANGID(LANG_KOREAN, SUBLANG_DEFAULT), 0, wz1, -1, wz2, -1));
printf(%d, CompareStringW(MAKELANGID(LANG_FRENCH, SUBLANG_DEFAULT), 0, wz1, -1, wz2, -1));
The results will be:
1
1
3
In other words, using French turned a CSTR_LESS_THAN into a CSTR_GREATER_THAN.
Can you guess why?
Hint #1: They ask me "why is my Korean text in random order?"
Hint #2: French collation: When diacritical becomes diabolical
Yes, that is right -- On all locales, the circled version of Hangul is considered to have a secondary difference when compared to the non-circled one, and on French the diacritics (secondary) weights are sorted in reverse order.
So when you use a French sort to look at Hangul, the order will end up being reversed!
Clearly the reverse diacritics sort would not be expected to apply to Hangul, but just as clearly reversing the diacritic weights has to affect the whole string and there is no good way to separate the two (other than to not handle those Korean differences using seconary weights even though they have a seconday difference).
Anyway, it seemed lilke an interesting issue to me, so I thought I'd share it!
This post brought to you by ㉮ (U+326e, a.k.a. CIRCLED HANGUL KIYEOK A)