LCMapString's *other* job

by Michael S. Kaplan, published on 2005/06/24 02:41 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/06/24/432121.aspx

To me, the NLS API function LCMapString has a full-time job, one that is crucial to the fundamental fabric of Windows -- sort key generation. I know it is crucial because if I accidentally mess up the tables then there are components that are unable to even let Windows finish booting up due to fear of corruption of information!

Clearly the order is important, and the ability to create indexes that use this order is therefore equally important. It is no accident that the name of this blog is Sorting It All Out, you know. Keeping it all in some order so that no matter how complex it is, you can eventually work your way through it is (to me) a hideously important operation that is crucial to all of our work. And not just our international work -- if we mess up the simple ABCs then how can we ever hope to handle really complex operations?

However, just as an artist often has to wait tables or bag groceries in order to pay her rent, LCMapString has been forced to take on some side work. :-)

(Now I know that is a completely revisionist way of looking at things and I know its not really how things are. But it's a more convenient way of looking at things for me, so like the Bohr model for the atom I am going to let the "not entirely accurate" model stand, since it a useful way of looking at things!)

Anyway, I thought I'd take a look at some of the other work this conversion function does....

I hint at uses of a few of these conversions in my post A few of the gotchas of CompareString but now I am going to try to lay it all out, once and for all.

As a side note, Julie Bennett once told me that she thought this function was kind of a hack, not because it wasn't useful but because it did too much, all in one place. It really wanted to be separate functions. Which is I guess a hazard of taking on too many part time job -- no one knows what your actual occupation is!

LCMAP_BYTEREV -- a very useful little conversion, whether used by itself or with any of the other results -- it will reverse the bytes in each word of a string. As the Platform SDk topic indicates, dor example, if you pass in 0x3450 0x4822 the result is 0x5034 0x2248. The conversion is the equivalent of lpDestStr[ich] = MAKEWORD( HIBYTE(lpSrcStr[ich], LOBYTE(lpSrcStr[ich]) ) across the whole string. Good honest side work for our girl!

LCMAP_FULLWIDTH -- Converts each character to a full width one every time it encounters a half width one (all other characters pass through unchanged). Thus ﾎ (U+ff8e a.k.a. HALFWIDTH KATAKANA LETTER HO) becomes ホ (U+30db a.k.a. KATAKANA LETTER HO). In the legacy Japanese code page, full width characters took up twice as much space (the half width characters were in the 'high ansi' range of the code page, greater than 0x7f but less than 0xff, where the full width ones were double-byte characters). As a convention the full width ones are twice as wide -- the same width as the ideographs -- and according to some people are considered to be less aesthetically pleasing. In Unicode obviously both sets take up two bytes, but the typographic tradition continues to this day, and therefore this is no longer really just a 'legacy code page' issue -- it is a real typographic difference.

LCMAP_HALFWIDTH -- Converts each character to a half width one every time it encounters a full width one (all other characters pass through unchanged). Thus ワ (U+30ef a.k.a. KATAKANA LETTER WA) becomes ﾜ (U+ff9c a.k.a. HALFWIDTH KATAKANA LETTER WA). See the previous LCMAP_FULLWIDTH text for more information on the difference between them.

LCMAP_HIRAGANA -- Converts each Katakana character to the equivalent Hiragana one (all other characters pass through unchanged). Thus ヅ (U+30c5 a.k.a. KATAKANA LETTER DU) becomes づ (U+3065 a.k.a. HIRAGANA LETTER DU). The more literal meaning of Hiragana is "smooth kana." The differences between Hiragana and Katakana are a bit beyond the scope of this blog post, but there is a fascinating Wikipedia article on it that covers the topic, and includes the poem Iroha-uta ("Song of colours"). This poem comes from the 10th century, and in a very cool way uses every hiragana once (and proves to me that Hiragana is more lyrically suited than English with its 'the quick brown fox jumpes over the lazy dog' nonsense!):

LCMAP_KATAKANA -- Converts each Hiragana character to the equivalent Katakana one (all other characters pass through unchanged). Thus ま (U+307e a.k.a. HIRAGANA LETTER MA) becomes マ (U+30de a.k.a. KATAKANA LETTER MA). The more literal meaning of Katakana is "partial kana." Again, the differences between Hiragana and Katakana are a bit beyond the scope of this blog post, but there is a fascinating Wikipedia article on this script too that covers the topic. The article does mention one important difference in usage between the two scripts:

Neither the Hiragana nor the Katakana conversions in Windows extend to cover this particular convention, though it is fascinating to contemplate doing so some day, in some kind of extension to the "linguistic casing" notion I'll talk about in a bit. Interesting feature idea, if it truly is the convention. :-)

LCMAP_UPPERCASE -- Maps lowercase characters to uppercase characters passing through other characters unchanged. Thus ç (U+00e7, a.k.a. LATIN SMALL LETTER C WITH CEDILLA) becomes Ç (U+00c7 a.k.a. LATIN CAPITAL LETTER C WITH CEDILLA). Can be modifed with the LCMAP_LINGUISTIC_CASING flag which enables a whole bunch of new scenarios, discussed when I asked (then answered) the question What does "linguistic casing" mean and plays a fundamental role in the life of all but the "C" locale CRT casing operations and functions like CharUpper, such that although technically I do not own those functions, I basically own those functions (isn't emphasis a wonderful thing? <grin>). . Note that none of these wrappers uses the LCMAP_LINGUISTIC_CASING flag, which means that unless they are calling LCMapStringA there is absolutely no effect whatsoever based on the locale, and all claims to the contrary in both PSDK and CRT documentation are in the long, slow process of being fixed. The last word that I have to say about uppercasing is Georgian.

LCMAP_SIMPLIFIED_CHINESE -- Maps traditional Chinese characters to simplified Chinese, passing through other characters unchanged. Thus 樂 (U+6a02) becomes 乐 (U+4e50). The dictionary used for this mapping is small (only 2,620 ideographs) and has not been updated since the feature was added in NT 4.0 (it was originally added at the request of people in Office, who actually ended up going with their own more sophisticaated dictionary solution in Word that does a better job with the sometimes complicated mapping. Now although casing, width, and Kana mappings can all be done in place, this is not allowed for traditional->simplified Chinese mappings, even though the same restrictions (always the same lnegth, etc.) apply here -- if any NLS testers who are reading this want to put in a bug, I'll see what I can do about fixing that!

LCMAP_TRADITIONAL_CHINESE -- Maps simplified Chinese characters to traditional Chinese, passing through other characters unchanged. Thus 儈 (U+5108) becomes 侩 (U+4fa9). The dictionary used for this mapping is even smaller (only 2,191 ideographs) since there are many times that several traditional Chinese ideographs will map to one simplified ideograph (thus these two flags are not 100% reversible versions of each other). The table has not been updated since the LCMAP_SIMPLIFIED_CHINESE one was. Same problems with in-place update apply here -- if any NLS testers who are reading this want to put in a bug, I'll resolve it as a duplicate of the other bug I was suggesting, above!

Ok, that is probably enough for today. Tip your server (she may have subroutines to support). Enjoy the veal!

This post brought you by "ﾎホワﾜヅづまマçÇĦħ乐樂儈侩" (U+ff8e U+30db U+30ef U+30c5 U+3065 U+307e U+30de U+00e7 U+00c7 U+0126 U+0127 U+4e50 U+6a02 U+4fa9 U+5108 a.k.a. HALFWIDTH KATAKANA LETTER HO, KATAKANA LETTER HO, KATAKANA LETTER WA, HALFWIDTH KATAKANA LETTER WA, KATAKANA LETTER DU, HIRAGANA LETTER DU, HIRAGANA LETTER MA, KATAKANA LETTER MA, LATIN SMALL LETTER C WITH CEDILLA, LATIN CAPITAL LETTER C WITH CEDILLA, LATIN CAPITAL LETTER H WITH STROKE, LATIN SMALL LETTER H WITH STROKE, HAPPY, HAPPY, BROKER, BROKER)
(A group of characters that were happy to be asked to help showcase the technology behind LCMapString!)

It's so cosmetic but the last "ne" is extraneous in Iroha Uta (ne is already on the fourth line ;-). I fixed referenced wikipedia article.

As for Simplified/Traditional Chinese and Japanese, I think some of Parenthesized CJK ideographs (U+3220 - U+3243) and Circled ones (U+3280-U+32B0) are sorted incorrectly. For example, U+32A2 is sorted next to U+5BEB (traditional Chinese) instead of corresponding U+5199 (Japanese).

Oops, I'm sorry. Apparently I needed more explaination.

What I had in mind is InvariantCulture.

In InvariantCulture U+32A2 has "ac 46 01 01 01 01" as its sortkey value, which is next to U+5BEB ("ac 45 01 01 01 01 00"). But U+32A2 is actually NFKD compatible with U+5199. For most of Circled CJK ideographs, each of them have such a sortkey that is next to its equivalent CJK character. But some characters (like U+32A2) are mapped incorrectly in that sense.

The same discussion applies to ja(-JP) too. I know (I should mention) CJK sort order depends the cultures (namely zh-CHS, zh-CHT, ko and possibly more). Though in ja-JP CJK sortkey values are shifted (from AC 45 ... to 8E 64 ...), this discussion is still true to them.

In zh-CHS it is not mapped to be equivalent anymore (and thus it keeps InvariantCulture mapping). I think it makes sense because as long as I have heard Chinese people don't use that mark (actually that is one of the reason I think it is "incorrect" mapping).

If my argument still makes sense, I can put some other "incorrect" mapping examples (in my speak).

Actually, we are not a 100% normalization shop in our collation (a fact about which I have posted before). ZBut even if we were, we would only deal with NFC and NFD -- NFKC and NFKD are both destructive operations in that they remove distinctions that are often important when they are in data.

いろはにほへと	Iro ha nihohe to	Even if colours have sweet perfume
ちにぬるを	chirinuru wo	eventually they fade away
わかよたれそ	wakayo tare so	What in this world
つねならむ	tsune naramu	is eternal ?
うゐのおくやま	uwi no okuyama	The deep montains of vanity
けふこえて	kefu koete	I cross them today
あさきゆめみし	asaki yume mishi	renouncing the superficial dreeams
ゑひもせすね	wehi mo sesu ne	not giving in to their madness any more