Round trip calls do not always go both ways

by Michael S. Kaplan, published on 2005/10/17 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/10/17/481627.aspx


A few days ago, Someone with the handle of MorningSunShine asked in the newsgroups:

I used WideCharToMultiByte() a lot to convert unicode characters to multibyte chars back and forth. However, when the unicode is between 0x3400 and 0x4dff, it failed to do its job. Why?
-- 
Wind, Forest, Fire, Mountain.

The range in question is a superset of the currently encoded characters in CJK Unified Ideographs Extension A (I say currently encoded because as of Unicode 4.1 only the code units between 0x3400 and 0x4dbf have assigned ideographs).

So the answer to MorningSunShine's post is in two parts:

1) The code units between U+4dc0 and U+4dff are not currently encodedcontain the Yijing Hexagram Symbols, which are not on any legacy code page, so of course attempting to roundtrip through successive WideCharMultiByte and MultiByteToWideChar calls will not work.; they have no identity in Unicode so they can't be expected to exist in any legacy code pages

2) The code units between U+3400 and U+4dbf are encoded as CJK Extension A, which was encoded in Unicode after all of the various legacy code pages. As I pointed out in this post, we do not change code pages any more; we have learned our lesson on that one.

So unless you are trying to roundtrip through UTF-8 or GB-18030 (or UTF-32/UTF-16BE on the .NET Framework!), there is no way you could ever really hope to have the characters survive an attempt to roundtrip through one of the other code pages....

Special thanks to Andrew West for reminding me about the Yijing Hexagram Symbols, and inspiring the correction above! in point #1!

 

This post sponsored by "?" (U+003f, a.k.a. QUESTION MARK)
The character that appears for almost all code pages when you try to convert from Unicode into them and the character does not exist....


# Andrew West on 17 Oct 2005 7:45 AM:

"The code units between U+4dc0 and U+4dff are not currently encoded".

Have you forgotten the Yijing Hexagram Symbols that have occupied U+4DC0..U+4DFF since Unicode 4.0 ? Of course, there will be no round-tripping for these characters as they are not in any of the legacy encodings (at least none that I know of).

# Michael S. Kaplan on 17 Oct 2005 9:52 AM:

Yikes, I did! Thanks for the correction, I will post it shortly....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day