LCMapString's *other* job

by Michael S. Kaplan, published on 2005/06/24 02:41 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/06/24/432121.aspx


To me, the NLS API function LCMapString has a full-time job, one that is crucial to the fundamental fabric of Windows -- sort key generation. I know it is crucial because if I accidentally mess up the tables then there are components that are unable to even let Windows finish booting up due to fear of corruption of information!

Clearly the order is important, and the ability to create indexes that use this order is therefore equally important. It is no accident that the name of this blog is Sorting It All Out, you know. Keeping it all in some order so that no matter how complex it is, you can eventually work your way through it is (to me) a hideously important operation that is crucial to all of our work. And not just our international work -- if we mess up the simple ABCs then how can we ever hope to handle really complex operations?

However, just as an artist often has to wait tables or bag groceries in order to pay her rent, LCMapString has been forced to take on some side work. :-)

(Now I know that is a completely revisionist way of looking at things and I know its not really how things are. But it's a more convenient way of looking at things for me, so like the Bohr model for the atom I am going to let the "not entirely accurate" model stand, since it a useful way of looking at things!)

Anyway, I thought I'd take a look at some of the other work this conversion function does....

I hint at uses of a few of these conversions in my post A few of the gotchas of CompareString but now I am going to try to lay it all out, once and for all.

As a side note, Julie Bennett once told me that she thought this function was kind of a hack, not because it wasn't useful but because it did too much, all in one place. It really wanted to be separate functions. Which is I guess a hazard of taking on too many part time job -- no one knows what your actual occupation is!

Here are the additional flags and what they do:

LCMAP_BYTEREV -- a very useful little conversion, whether used by itself or with any of the other results -- it will reverse the bytes in each word of a string. As the Platform SDk topic indicates, dor example, if you pass in 0x3450 0x4822 the result is 0x5034 0x2248. The conversion is the equivalent of lpDestStr[ich] = MAKEWORD( HIBYTE(lpSrcStr[ich], LOBYTE(lpSrcStr[ich]) ) across the whole string. Good honest side work for our girl!

LCMAP_FULLWIDTH -- Converts each character to a full width one every time it encounters a half width one (all other characters pass through unchanged). Thus (U+ff8e a.k.a. HALFWIDTH KATAKANA LETTER HO) becomes (U+30db a.k.a. KATAKANA LETTER HO). In the legacy Japanese code page, full width characters took up twice as much space (the half width characters were in the 'high ansi' range of the code page, greater than 0x7f but less than 0xff, where the full width ones were double-byte characters). As a convention the full width ones are twice as wide -- the same width as the ideographs -- and according to some people are considered to be less aesthetically pleasing. In Unicode obviously both sets take up two bytes, but the typographic tradition continues to this day, and therefore this is no longer really just a 'legacy code page' issue -- it is a real typographic difference.

LCMAP_HALFWIDTH -- Converts each character to a half width one every time it encounters a full width one (all other characters pass through unchanged). Thus  (U+30ef a.k.a. KATAKANA LETTER WA) becomes (U+ff9c a.k.a. HALFWIDTH KATAKANA LETTER WA). See the previous LCMAP_FULLWIDTH text for more information on the difference between them.

LCMAP_HIRAGANA -- Converts each Katakana character to the equivalent Hiragana one  (all other characters pass through unchanged). Thus (U+30c5 a.k.a. KATAKANA LETTER DU) becomes  (U+3065 a.k.a. HIRAGANA LETTER DU). The more literal meaning of Hiragana is "smooth kana." The differences between Hiragana and Katakana are a bit beyond the scope of this blog post, but there is a fascinating Wikipedia article on it that covers the topic, and includes the poem Iroha-uta ("Song of colours"). This poem comes from the 10th century, and in a very cool way uses every hiragana once (and proves to me that Hiragana is more lyrically suited than English with its 'the quick brown fox jumpes over the lazy dog' nonsense!):

いろはにほへと                     Iro ha nihohe to                    Even if colours have sweet perfume               
ちにぬるを chirinuru wo eventually they fade away
わかよたれそ wakayo tare so What in this world
つねならむ tsune naramu is eternal ?
うゐのおくやま uwi no okuyama The deep montains of vanity
けふこえて kefu koete I cross them today
あさきゆめみし asaki yume mishi renouncing the superficial dreeams
ゑひもせすね wehi mo sesu ne not giving in to their madness any more

LCMAP_KATAKANA -- Converts each Hiragana character to the equivalent Katakana one  (all other characters pass through unchanged). Thus (U+307e a.k.a. HIRAGANA LETTER MA) becomes (U+30de a.k.a. KATAKANA LETTER MA). The more literal meaning of Katakana is "partial kana." Again, the differences between Hiragana and Katakana are a bit beyond the scope of this blog post, but there is a fascinating Wikipedia article on this script too that covers the topic. The article does mention one important difference in usage between the two scripts:

Katakana spelling differs slightly from hiragana. While hiragana spells long vowels with the addition of a second vowel kana, katakana uses a vowel extender mark. This mark is a short line following the direction of the text (horizontal in horizontal text, vertical in columns).

Neither the Hiragana nor the Katakana conversions in Windows extend to cover this particular convention, though it is fascinating to contemplate doing so some day, in some kind of extension to the "linguistic casing" notion I'll talk about in a bit. Interesting feature idea, if it truly is the convention. :-)

LCMAP_UPPERCASE -- Maps lowercase characters to uppercase characters passing through other characters unchanged. Thus ç (U+00e7, a.k.a. LATIN SMALL LETTER C WITH CEDILLA) becomes Ç (U+00c7 a.k.a. LATIN CAPITAL LETTER C WITH CEDILLA). Can be modifed with the LCMAP_LINGUISTIC_CASING flag which enables a whole bunch of new scenarios, discussed when I asked (then answered) the question What does "linguistic casing" mean and plays a fundamental role in the life of all but the "C" locale CRT casing operations and functions like CharUpper, such that although technically I do not own those functions, I basically own those functions (isn't emphasis a wonderful thing? <grin>). . Note that none of these wrappers uses the LCMAP_LINGUISTIC_CASING flag, which means that unless they are calling LCMapStringA there is absolutely no effect whatsoever based on the locale, and all claims to the contrary in both PSDK and CRT documentation are in the long, slow process of being fixed. The last word that I have to say about uppercasing is Georgian.

LCMAP_LOWERCASE -- Maps uppercase characters to lowercase characters passing through other characters unchanged. Thus Ħ (U+0126, a.k.a. LATIN CAPITAL LETTER H WITH STROKE) becomes ħ (U+0127, a.k.a. LATIN SMALL LETTER H WITH STROKE). Can be modifed with the LCMAP_LINGUISTIC_CASING flag which enables a whole bunch of new scenarios, discussed when I asked (then answered) the question What does "linguistic casing" mean and plays a fundamental role in the life of all of the "C" locale CRT casing operations and functions like CharLower, such that although technically I do not own those functions, I basically own those functions (isn't emphasis a wonderful thing? <grin>). Note that none of these wrappers uses the LCMAP_LINGUISTIC_CASING flag, which means that unless they are calling LCMapStringA there is absolutely no effect whatsoever based on the locale, and all claims to the contrary in both PSDK and CRT documentation are in the long, slow process of being fixed. The last word I have to say about lowercasing is Sigma.

LCMAP_SIMPLIFIED_CHINESE -- Maps traditional Chinese characters to simplified Chinese, passing through other characters unchanged. Thus  (U+6a02) becomes (U+4e50). The dictionary used for this mapping is small (only 2,620 ideographs) and has not been updated since the feature was added in NT 4.0 (it was originally added at the request of people in Office, who actually ended up going with their own more sophisticaated dictionary solution in Word that does a better job with the sometimes complicated mapping. Now although casing, width, and Kana mappings can all be done in place, this is not allowed for traditional->simplified Chinese mappings, even though the same restrictions (always the same lnegth, etc.) apply here -- if any NLS testers who are reading this want to put in a bug, I'll see what I can do about fixing that!

LCMAP_TRADITIONAL_CHINESE -- Maps simplified Chinese characters to traditional Chinese, passing through other characters unchanged. Thus (U+5108) becomes (U+4fa9). The dictionary used for this mapping is even smaller (only 2,191 ideographs) since there are many times that several traditional Chinese ideographs will map to one simplified ideograph (thus these two flags are not 100% reversible versions of each other). The table has not been updated since the LCMAP_SIMPLIFIED_CHINESE one was. Same problems with in-place update apply here -- if any NLS testers who are reading this want to put in a bug, I'll resolve it as a duplicate of the other bug I was suggesting, above!

Ok, that is probably enough for today. Tip your server (she may have subroutines to support). Enjoy the veal!

 

This post brought you by "ホホワワヅづまマçÇĦħ乐樂儈侩" (U+ff8e U+30db U+30ef U+30c5 U+3065 U+307e U+30de U+00e7 U+00c7 U+0126 U+0127 U+4e50 U+6a02 U+4fa9 U+5108 a.k.a. HALFWIDTH KATAKANA LETTER HO, KATAKANA LETTER HO, KATAKANA LETTER WA, HALFWIDTH KATAKANA LETTER WA, KATAKANA LETTER DU, HIRAGANA LETTER DU, HIRAGANA LETTER MA, KATAKANA LETTER MA, LATIN SMALL LETTER C WITH CEDILLA, LATIN CAPITAL LETTER C WITH CEDILLA, LATIN CAPITAL LETTER H WITH STROKE, LATIN SMALL LETTER H WITH STROKE, HAPPY, HAPPY, BROKER, BROKER)
(A group of characters that were happy to be asked to help showcase the technology behind LCMapString!)


# Duncan on 24 Jun 2005 10:28 AM:

On top of 乐 and 樂, which are the two Chinese ways (simplified and traditional) to write the character that means 'happiness', the Japanese way is: 楽.

乐 樂 楽

# Michael S. Kaplan on 24 Jun 2005 11:27 AM:

There are often multiple ways to do it, even when you asre just moving in the circle of Chinese....

# Atsushi Enomoto on 25 Jun 2005 12:35 AM:

It's so cosmetic but the last "ne" is extraneous in Iroha Uta (ne is already on the fourth line ;-). I fixed referenced wikipedia article.

As for Simplified/Traditional Chinese and Japanese, I think some of Parenthesized CJK ideographs (U+3220 - U+3243) and Circled ones (U+3280-U+32B0) are sorted incorrectly. For example, U+32A2 is sorted next to U+5BEB (traditional Chinese) instead of corresponding U+5199 (Japanese).

# Michael S. Kaplan on 25 Jun 2005 3:37 AM:

Sorted incorrectly in what locale? :-)

# Atsushi Enomoto on 25 Jun 2005 10:59 AM:

Oops, I'm sorry. Apparently I needed more explaination.

What I had in mind is InvariantCulture.

In InvariantCulture U+32A2 has "ac 46 01 01 01 01" as its sortkey value, which is next to U+5BEB ("ac 45 01 01 01 01 00"). But U+32A2 is actually NFKD compatible with U+5199. For most of Circled CJK ideographs, each of them have such a sortkey that is next to its equivalent CJK character. But some characters (like U+32A2) are mapped incorrectly in that sense.

The same discussion applies to ja(-JP) too. I know (I should mention) CJK sort order depends the cultures (namely zh-CHS, zh-CHT, ko and possibly more). Though in ja-JP CJK sortkey values are shifted (from AC 45 ... to 8E 64 ...), this discussion is still true to them.

In zh-CHS it is not mapped to be equivalent anymore (and thus it keeps InvariantCulture mapping). I think it makes sense because as long as I have heard Chinese people don't use that mark (actually that is one of the reason I think it is "incorrect" mapping).

If my argument still makes sense, I can put some other "incorrect" mapping examples (in my speak).

# Michael S. Kaplan on 25 Jun 2005 10:15 PM:

Actually, we are not a 100% normalization shop in our collation (a fact about which I have posted before). ZBut even if we were, we would only deal with NFC and NFD -- NFKC and NFKD are both destructive operations in that they remove distinctions that are often important when they are in data.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/06/25 Seeing the tears, my heart went out to her as I asked her "Why the Long S?"

2007/10/22 Traditional to Simplified or vice-versa? According to Windows, you're on your own....

2006/10/20 Complex string mapping

2005/10/20 Parameter confusion

2005/09/11 Fonts that are 'fixed-width' even if they do not claim to be

2005/07/24 When other teams lock down your implementation....

go to newer or older post, or back to index or month or day