If it does [best] fit, it may be off a bit! (aka Parlez-ゔ japonais?)

by Michael S. Kaplan, published on 2007/11/22 10:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/11/22/6463359.aspx


So just yesterday, Kelvin Houghton had an excellent question:

Hello All.

I have a strange issue I would like your help on please.  In a C# app if I have the line

    Console.WriteLine("\u3094");

I would expect to see the output of character ゔ But it instead outputs u30F4 which would look like this character rヴ

My question is why does u30F4 get displayed when I told it to display u3094?

Thanks
Kelvin

Like I said, an excellent question!

Globalization ace Garrett McGowan had the answer pretty quickly:

This is because there’s no equivalent character in the Japanese Windows code page (cp932). It’s returning the closest equivalent, the katakana form of the symbol.

A fact that you confirm by looking at the "best fit" tables from Microsoft that are publicly available and hosted on the Unicode site here, with an excerpt below:

0x3091  0x82ef    ;ゑ Hiragana We
0x3092  0x82f0    ;を Hiragana Wo
0x3093  0x82f1    ;ん Hiragana N
0x3094  0x8394    ;ヴ Hiragana Vu (add best-fit 2/1/96)
0x309b  0x814a    ;゛ Katakana-Hiragana Voiced Sound Mark
0x309c  0x814b    ;゜ Katakana-Hiragana Semi-Voiced Sound Mark
0x309d  0x8154    ;ゝ Hiragana Iteration Mark
0x309e  0x8155    ;ゞ Hiragana Voiced Iteration Mark
0x30a1  0x8340    ;ァ Katakana Small A
0x30a2  0x8341    ;ア Katakana A

As a bonus, everyone can reflect on the power of comments that no one probably realized were there? :-)

IPE expert Paul Chavez made an interesting point about the missing-ness of this character in shift-JIS:

Makes sense asis only used for purely phonetic writing of “foreign sounds”  Although Hiragana is phonetic also, it is not used when writing “foreign sounds”.

I know as my family name is spelled チャヴェズ.

And as a closing bit, there is a nice little note in the hiragana entry on everything2.com:

It should be mentioned that there are several obsolete kana that are rarely used today, in both hiragana and katakana. In hiragana:

ゑ - the "we" hiragana
ゐ - the "wi" hiragana
ゔ - the "vu" hiragana

All of these kana are considered obsolete, and exist only for use in transcribing older documents. In cases where the "vu" hiragana is used, the still in use katakana "vu" is placed instead, and when formed into another syllable, a smaller kana vowel is paired with it.

When all of this info had been passed about, Kelvin did have one more question to ask:

Just another question this raises.  If you use charmap.exe or wordpad.exe it does display the characters correctly – how are they able to do that?  Trying to fully understand as we have ######1 who is trying to localize a file name that uses u3094 and that is what displays incorrectly.

Now that does boil down to the basic Unicode vs. not issue -- and in particular the Console.WriteLine behavior is explained in Sometimes, the shortcuts give better AND faster results. A .NET limitation that can be worked around (when necessary) with WriteConsoleW, and an ISV limitation that can be worked around (when necessary) by converting to Unicode!

 

1 - Third party independent software vendor removed just for the hell of it, since this is mostly a .NET issue anyway, not an iSV bug.

 

This post brought to you by(U+3094, a.k.a. HIRAGANA LETTER VU)


khiara on 16 Feb 2009 7:04 PM:

allright it doesnt give u much information!!!


referenced by

2012/02/20 Where short file names can fail

2008/05/08 In hindsight, they may have BEST FIT these files where the sun never shines

go to newer or older post, or back to index or month or day