Don't look directly at the 951 code page if you can avoid it

by Michael S. Kaplan, published on 2007/09/27 03:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/27/5132462.aspx


K. M. Leung asks over the Suggestion Box:

Big5 Unicode conversion in .net 2.0.

I have read your article "Kowloon 951" and Ji Cheng's question. I know that there are a couple of ways to twist .net 2.0's encoding, such as changing the EncoderFallback. What if I am using BizTalk 2006's flatfile PipeLine encoding?

From the latest unicode version, Big5 9563 should converted to UTF16 8137 (As Cheung said, it is the case with Framework 1.1 and HKSCS). As you have said, .net 2.0 came with its own encoding tables. Can we say that it is a bug for .net 2.0 to convert Big5 9563 to UTF16 E77F and Microsoft should provide a fix? Below is the code that you can run under (1.1 + HKSCS) and (2.0)

           string tsource = "95 63";
           string[] ahex = tsource.Split(new char[] { ' ' });
           byte[] source = new byte[ahex.Length];
           int i = 0;

           foreach (string hex in ahex) {
               source[i] = (byte)ushort.Parse(hex, System.Globalization.NumberStyles.HexNumber);
               i += 1;
           }

           byte[] target = Encoding.Convert(Encoding.GetEncoding("Big5"), Encoding.BigEndianUnicode, source);
           string tTarget = "";

           foreach (Byte bb in target) {
               tTarget += bb.ToString("X") + " ";
           }

           Console.WriteLine("Big5: " + tsource + " is converted to UTF16: " + tTarget);

           Console.ReadLine();

(The Kowloon 951 post can be found here.) 

This is actually by design.

The whole "code page 951" hack for HKSCS was just a hack, and not at all intended to be the way that HKSCS should be supported in Windows (the real HKSCS solution needs and has its own solution that does not involve returning results not in the original code page 950, which treats the code point in question as part of the Big5 and Unicode private use areas for EUDC, respectively....

It was really a short term mistake in Windows that .NET 1.0 and 1.1 inherited by accident of all the code page support coming from Windows.

I would definitely recommend against trying to use the Encoding support in .NET >= 2.0 to munge any code page....

On a not entirely unrelated note (and perhaps partially in recognition of issues I pointed out in posts like this one!), just this last month, the Chinese Language Interface Advisory Committee (CLIAC) made it clear that their intent is to in the future only assign code points to HKSCS when they have been assigned to Unicode (you can read about this in the NOTICE entitled Revised Principles for the Inclusion of Characters in the HKSCS, or you read the 通告 in Chinese entitled 修訂《香港增補字符集》字符增收原則).

 

This post brought to you by (U+8137, a CJK Unified Ideograph that is appears to be in neither the Taiwan CNS-11643 standard nor Microsoft's Big5 code page)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/01/13 Doing it for appearances, Hong Kong style!

go to newer or older post, or back to index or month or day