...from the Microsoft point of view

by Michael S. Kaplan, published on 2010/08/22 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/08/22/10052904.aspx

We have a question related to System.Text.Encoding.GetEncoding() API. Encoding.GetEncoding().WebName returns following values:

Encoding.GetEncoding(1200).WebName // 1200 represents UTF16 Little Endian
"utf-16"
Encoding.GetEncoding(1201).WebName // 1201 represents UTF16 Big Endian
"utf-16BE"

Thus unmarked representation (“utf-16”) is interpreted as Little Endian.
But according to RFC#2781, and below pasted excerpt, it looks like unmarked representation should be interpreted as Big Endian in the absence of BOM.
4.3 Interpreting text labelled as UTF-16
   Text labelled with the "UTF-16" charset might be serialized in either
   big-endian or little-endian order. If the first two octets of the
   text is 0xFE followed by 0xFF, then the text can be interpreted as
   being big-endian. If the first two octets of the text is 0xFF
   followed by 0xFE, then the text can be interpreted as being little-
   endian. If the first two octets of the text is not 0xFE followed by
   0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be
   interpreted as being big-endian.

In truth, the implied text when you see SHOULD is SHOULD, UNLESS YOU HAVE A REALLY, REALLY, REALLY GOOD REASON NOT TO.

Now in the case of Microsoft, which really is a Little Endian UTF-16 shop through and through. So much so that you get other weird stuff happening like I mentioned in unicodeFFFE... is Microsoft off its rocker?.

In fact, it only gets truly weird in cases when the cases pop up where .Net is on a platform where UTF-16 little endian may not be a sensible assumption, but thankfully such cases are mercifully few.

"Microsoft, which really is a Little Endian UTF-16 shop through and through."

Yes, and that's fine <em>if</em> you assume that Microsoft products only talk to other Microsoft products. Wow, it's a good job that Microsoft products don't have to talk to any other products, from any other vendors, on any other OSs, on any other architectures, on some kind of vast, global, heterogeneous internetwork of computers, isn't it.

Wait, what? Is MS still living in 1990 or something? Seriously?

Microsoft includes ways to talk and interoperate with those other products and platforms -- and a huge user base that has neither the interest nor the need for the additional complication....

Btw, on platform such as Linux, the Unicode they usually use is UTF-8 (with doesn't have LE/BE issue), so just translate your "text to send to interop" to UTF-8 first and you'll be fine.

I'd think the same goes for BSD, Mac or other *nix systems.

Why not just name cp1200 "utf-16LE" and remove any doubt about the byte order?

You mean change it and break anyone relying on the old result? I think the answer is in the question.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.