The default character is not always the question mark

by Michael S. Kaplan, published on 2005/12/03 21:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/12/03/499661.aspx


It seems like common sense, but every once in a while it is important to remember: Unicode is big.

Really big.

And not just in a simple Ed Sullivan really big standard sense.

Or even in a Morrissey some girls encodings are bigger than other girls encodings sense.

 

I mean in the literal sense that the only code pages that are as big as Unicode are the ones that are actually tied to Unicode like GB-18030.

Because of that size, it makes sense to plan for the fact that any time you convert from Unicode to some other code page via WideCharToMultiByte, you might find that one or more of the characters are not there. And that plan is easy: every code page has a default character which is inserted where there was no other character to insert.

Plus you also need a character to insert when converting in the other direction (via MultiByteToWideChar) for times when the data makes no sense (like a byte that was not on that source code page!).

For almost all code pages, the default character is a question mark (?), good old 0x3f on all of the Windows 'ANSI' code pages and U+003f in Unicode. Which in most cases is a great way to communicate in a single character the message of the Micrsosoft Windows Jeff Spicoli code page surfer who is trying to say "Huh? Dude, I have no idea what you're talking about."

Perhaps for Greek the semicolon would have been a better choice since that is the preferred marker used at the end of a sentence that is a question, but Microsoft did not go in that direction for Windows code page 1253. I asked Cathy about this once out of curiousity and her response was a very serious and unflinching we're not changing that, Michael which doesn't say whether it might have been a good idea ten years ago or not -- probably not!

I tend to forget about the exception to the rule that the question mark is the default character, but I remembered it the other day when I got the following e-mail from someone going by the handle of misterhektik:

I was hoping you could shed some light on a problem I'm having.  I play an online international game, and am making a program in C# to parse the chatlogs from this game.  The chatlogs are encoded in shift-jis.  There is alot of hex garbage produced by the game that is dumped into the chatlogs.  My program aims to remove all the garbage, and succeeds....almost.  For some reason when using C#/.Net the hex combination %EF%27 and %EF%28 are both converted into %30FB.  I don't understand why.  I've written this same program as a perl script and it works without a problem. 

My question is, does it have something to do with the encoding?  That is the only logical thing I can think of.  If so, am I using the wrong encoding to read the files?

Thanks.

On Windows code page 932 (Shift-JIS), the default character out of Unicode is still 0x3f, but on the way into Unicode it is (wait for it) U+30fb, also known as KATAKANA MIDDLE DOT.

Now working backwards at 0xef27 and 0xef28, it turns out that the lead byte 0xef on code page 932 is one of those reserved lead bytes I talked about in 100% roundtrip ASCII? 100% roundtrip ANSI? last month. Which means there is no trail byte that combines with it make a valid character.

And that fits in well with the notion of the 'hex garbage' that misterhektik was referring to.

In the words of the late George Peppard, I love it when a plan comes together.

Now while it is true that WideCharToMultiByte gives you a way to override the default character, it is an unfortunate truth that MultiByteToWideChar does not. So there is no way in Windows to change what that character will be when it is needed.

I honestly do not know the frequency of usage of the KATAKANA MIDDLE DOT in Japanese text, but if it is anything like the question mark in English then that means it is used just often enough to make replacing it after the conversion annoying....

But since misterhektik mentioned he was using C#, he may have an option if he is using Whidbey since you can override the fallback mechanism and customize the handling of this situation (if that is required).

I will try to remember in the future that the question mark is not always the default character; it only usually is. :-)

 

This post brought to you by "" (U+30fb, a.k.a. KATAKANA MIDDLE DOT)


# Michael Dunn_ on 3 Dec 2005 10:05 PM:

I mean, you may think it's a long way down the street to the chemist, but that's just _peanuts_ to unicode. ;)

The middle dot is a word separator in katakana, although it's not used all the time. If there are two common katakana words together, and the word boundary is obvious, then the dot isn't used. You'll see it used often in foreign names, you could write mine as: ドン・マイク (don maiku)

# Rosyna on 4 Dec 2005 1:34 AM:

What he said. It's not used very often at all. The only times I am seeing it here (in Japan) is for names in katakana like キング・コング (gah, which I see EVERYWHERE). http://www.kk-movie.jp/top.html

In fact, it is probably used far less than the question mark in english (but maybe more than the question mark in japanese).

# CornedBee on 4 Dec 2005 4:22 AM:

Not a Unicode conversion, but there's one place where unrecognized bytes are almost invariably displayed as a dot: hex editors.

# Michael S. Kaplan on 4 Dec 2005 10:06 AM:

I think they use U+00b7 there in hex editors, rather than KATAKANA MIDDLE DOT. :-)

referenced by

2005/12/09 More on the C4819 error

go to newer or older post, or back to index or month or day