Why WC2MB needs a CP, chaver sheli!

by Michael S. Kaplan, published on 2008/04/30 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/04/30/8440675.aspx


Dmitry asked via the Contact link:

Hello Michael!

As far as I understand, some wide char can be in only one of Unicode subranges listed e.g. at
http://msdn2.microsoft.com/en-us/library/ms776439(VS.85).aspx, so its value determines its code page. But then why does WideCharToMultiByte needs the codepage parameter?!

The question comes from practice: if I get some wide char string from somewhere with unknown codepage, how do I get MBCS analog for it? Even more, the string may contain chars from different codepages..

It makes me thinking that Unicode (or its implementation in WinAPI) is not so universal if Unicode chars are not self-descriptive and need additional information about them..

Thanks.
Dmitry.

Sorry Dmitry, you are confusing apples and earmuffs on this one -- the Unicode Subset Bitfields are not connected to the Code Pages supported by Windows, at all really.

They really serve two different purposes -- the Unicode Subset Bitfields are a way that typography can use to group Unicode characters together based on Unicode subranges. Each character in Unicode is in one and only one of those subranges, and if you will note the definition of the various Unicode Subset Bitfields, some actually contain more than one of the Unicode subranges. This is very good since the OS/2 table of the font that contains those same Unicode range bits is Running a bit short on space and all. :-)

The Code Pages supported by Windows, on the other hand, are various subsets -- small groupings of characters that are each targeting a particular market or markets. They each define various mappings to Unicode, and Unicode characters can appear in more than one code page (e.g. all of the characters in ASCII are supported on most of them, and many of the letters in Greek are supported on code pages 1253, 737, 28597, and 932 under the "83" lead byte, among others. So there is no way to know what code page to use for the mapping!

Now if one was looking to find out the script(s) of various strings, I have mentioned the >= Vista GetStringScripts and the < Vista DownlevelGetStringScripts functions in the past in relation to the Mitigation tools for IDN security problems. But that isn't about the Code Pages supported by Windows, or strictly speaking the Unicode Subset Bitfields, either.

Thus the apples vs. earmuffs contrast. :-)

In the middle of Dmitry's note, one line in particular caught my eye:

Even more, the string may contain chars from different codepages.

Too true, that! But that is an argument in favor of the fact that code pages are really not enough -- Unicode is actually doing just fine here!

 

This blog brought to you by Α (U+0391, aka GREEK CAPITAL LETTER ALPHA, a letter of several code pages)


Michael S. Kaplan on 30 Apr 2008 4:42 PM:

For the curious, the end of the title is Hebrew for "my friend" and is pronounced chah-vear shell-ee (חבר שלי). So I have that same sound throughout the title with the items ending with that "ee" sound. :-)


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day