Complex string mapping

by Michael S. Kaplan, published on 2006/10/20 10:03 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/10/20/847933.aspx


So, the note I got the other day was:

Hi Michael,

This is Chris 􏰁􏰁􏰁􏰁􏰁􏰁, we met at lunch at the Unicode conference in San Francisco some months back.

...

I have a new question and judging from your blog, you might have the answer to this one as well. I'd like to set up our system to normalize text that enters our system to full-width kana and half-width latin.

I noticed that LCMapString provides the option to convert to full-width or half-width. However, it appears that it will convert everything to one or the other. Is there a way to get what I'm looking for without first analyzing the text to determine if it contains latin and/or kana before calling the function? or am I better off just implementing the conversion on my own?

Hope all is well,

Chris
 

This is the sort of thing that could make a nice interview question if the candidate either had some familiarity with Unicode and you took a few minutes to explain the available, relevant NLS API functions. For today I'll just go ahead and answer it; if you want to treat it like an interview question then don't look at the code below. :-)

Well, there is no way to automatically do everything in a single function call -- these flags in LCMapString that make up LCMapString's other job work on the whole string that is passed to the function, trying to map it as requested.

However, you can make use of a single GetStringTypeW call with the CT_CTYPE3 flag, after which you can scroll through the WORD array that is returned and use the data in to figure out what to convert and how. Something like the following hastily put together console app that will run just fine with a Japanese system locale (if it isn't then you may see the odd question mark in the output and would have to run under the debugger to prove to yourself that everything will work):

#define _UNICODE
#define UNICODE
#include <stdio.h>
#include <windows.h>

void wmain(int argc, wchar_t *argv[ ]) {
    if(argc != 2) {
        // Some kind of usage message might be nice here
    } else {
        int cch = lstrlenW(argv[1]);
        WORD * lpCharType = (WORD *)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, cch * sizeof(WORD));
        if(NULL != lpCharType) {
            if(GetStringTypeW(CT_CTYPE3, argv[1], cch, lpCharType)) {
                wchar_t * wzResults = (wchar_t *)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, (cch + 1) * sizeof(WCHAR));
                if(NULL != wzResults) {
                    int ich;

                    for(ich = 0; ich < cch; ich++) {
                        if((lpCharType[ich] & (C3_KATAKANA | C3_HALFWIDTH)) == (C3_KATAKANA | C3_HALFWIDTH)) {
                            // Half width katakana; since NLS identified it, assume we can convert it
                            LCMapStringW(LOCALE_INVARIANT, LCMAP_FULLWIDTH, &(argv[1][ich]), 1, &wzResults[ich], 1);
                        } else if((lpCharType[ich] & (C3_ALPHA | C3_FULLWIDTH | C3_HIRAGANA | C3_KATAKANA)) == (C3_ALPHA | C3_FULLWIDTH)) {
                            // Full width Alpha that is not Hiragana or Katakana; since NLS identified it, assume we can convert it
                            LCMapStringW(LOCALE_INVARIANT, LCMAP_HALFWIDTH, &(argv[1][ich]), 1, &wzResults[ich], 1);
                        } else {
                            // Just copy over everything else, as is
                            wzResults[ich] = argv[1][ich];
                        }
                    }
                    wprintf(L"Resulting string of size %d is: %s", cch, wzResults);
                    HeapFree(GetProcessHeap(), 0, (LPVOID)wzResults);
                }
            }
            HeapFree(GetProcessHeap(), 0, (LPVOID)lpCharType);
        }
    }

You can pick an interesting test string like QケケQけ, which is  U+ff31 U+30b1 U+ff79 U+0051 U+3051 or:

FULLWIDTH LATIN CAPITAL LETTER Q
KATAKANA LETTER KE
HALFWIDTH KATAKANA LETTER KE
LATIN CAPITAL LETTER Q
HIRAGANA LETTER KE

Now the small dance with C3_ALPHA, C3_HIRAGANA, and C3_KATAKANA is necessary for the reasons I mentioned in Is Kana 'alphabetic' ? Depends on who you ask.....

And of course you may have to decide what you wanted to do with numbers and/or punctuation (this code as written will just copy them as they are).

But you get the idea.

In theory if you have multiple characters of the same type in a row you could try to call LCMapString with larger string, but in practice the time to do all that checking may not be worth the effort. You can play with it, and see what you think (though if you want to provide a sample that proves such an optimization then be sure you include your profile numbers that proves it's faster!).

 

This post brought to you by (U+ff79, a.k.a. HALFWIDTH KATAKANA LETTER KE)


# Haali on 20 Oct 2006 1:12 PM:

A minor problem: PSDK documentation doesn't list HEAP_ZERO_MEMORY as a valid flag for HeapFree().

# Michael S. Kaplan on 20 Oct 2006 4:31 PM:

I think I am just always in the habit of passing the same flags to both functions! :-)

But you are right, it is not very meaningful there, unless maybe it zeroes out the memory as a part of the free as a security thing?

In any case, fixed now....

# Mihai on 20 Oct 2006 10:27 PM:

Interview question: what is wrong with the code above?

Hint: try ペベ <U+FF8D U+FF9F U+FF8D U+FF9E>

# Michael S. Kaplan on 20 Oct 2006 10:40 PM:

Hmmm...works just fine for me;

  HALFWIDTH KATAKANA LETTER HE
  HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
  HALFWIDTH KATAKANA LETTER HE
  HALFWIDTH KATAKANA VOICED SOUND MARK

becomes

  KATAKANA LETTER HE
  KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
  KATAKANA LETTER HE
  KATAKANA-HIRAGANA VOICED SOUND MARK

just as I would expect?

# Michael S. Kaplan on 20 Oct 2006 10:43 PM:

Or

ペベ  ----->   ヘ゜ヘ゛

if you want characters, and

U+ff8d U+ff9f U+ff8d U+ff9e  -->  U+30d8 U+309c U+30d8 U+309b

if you want code points....

# Mihai on 21 Oct 2006 8:38 PM:

You should get ぺべ <U+307A U+3079>

This is what you get if you call LCMapString on the full buffer, instead of doing it one codepoint at the time.

One of the basic i18n things: process full strings, not one character/codepoint at the time :-)

Looks like you are having a bad day :-)

# Michael S. Kaplan on 21 Oct 2006 8:43 PM:

Ah, but that doesn't meet the initial requirements of the halfwidth Latin mixed in, unless you add the logic to separate the string into runs, handling each "run" separately....

Did you have an algorithm you wanted tp share for that part? :-)

# Mihai on 21 Oct 2006 10:46 PM:

<<unless you add the logic to separate the string into runs, handling each "run" separately....>>

Now, that's and interviw answer :-)

I might have something, but not ready to publish. Because I have to find it first :-)

It is something I have wrote many-many years ago, to convert software glossaries from narow to wide. It was when the Windows UI moved from narow (Win 3.x) to wide (Win 95).

Only 10 years :-)

In fact, it is probably beter if I just rewrite it. This way I can also be sure is legal :-)

# Michael S. Kaplan on 22 Oct 2006 1:53 AM:

One could also go the other direction and just convert the whole string to full width and then convert the alpha that is not kana to half width, though that would mean two passes across the string by NLS and one for the user....

# Mihai on 22 Oct 2006 10:55 PM:

I have also thought about it.

But should be a bit more than "the alpha that is not kana," because wide $, wide # (and all the other wide Latin stuff that is in the FF01-FF5E range and is not alpha) should also be converted to narow.

Then we have another problem at FFE0, with the question "what about the wide Yen?" which is a problem in general :-)

# Michael S. Kaplan on 23 Oct 2006 1:02 AM:

Or perhaps it needs to be context sensitive -- wide when next to the Kana and narrow when not?

# Mihai on 23 Oct 2006 12:03 PM:

<<wide when next to the Kana and narrow when not>>

And now we have to see how to define "next to the Kana" :-)

What if they are in between Kana and Latin? :-)

This proves how an apparently simple problem proves to be quite complex. Good interview question, no doubt.


go to newer or older post, or back to index or month or day