by Michael S. Kaplan, published on 2010/08/22 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/08/22/10052904.aspx
Sometimes it depends on your point of view.
Gaurav's question was:
Hi,
We have a question related to System.Text.Encoding.GetEncoding() API. Encoding.GetEncoding().WebName returns following values:
Encoding.GetEncoding(1200).WebName // 1200 represents UTF16 Little Endian
"utf-16"
Encoding.GetEncoding(1201).WebName // 1201 represents UTF16 Big Endian
"utf-16BE"
Thus unmarked representation (“utf-16”) is interpreted as Little Endian.
But according to RFC#2781, and below pasted excerpt, it looks like unmarked representation should be interpreted as Big Endian in the absence of BOM.
4.3 Interpreting text labelled as UTF-16
Text labelled with the "UTF-16" charset might be serialized in either
big-endian or little-endian order. If the first two octets of the
text is 0xFE followed by 0xFF, then the text can be interpreted as
being big-endian. If the first two octets of the text is 0xFF
followed by 0xFE, then the text can be interpreted as being little-
endian. If the first two octets of the text is not 0xFE followed by
0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be
interpreted as being big-endian.
Is there a bug in the API or our understanding is wrong?
Thanks,
Gaurav
As I said, it really depends on your point of view.
The word SHOULD in standards is an interesting one though.
In truth, the implied text when you see SHOULD is SHOULD, UNLESS YOU HAVE A REALLY, REALLY, REALLY GOOD REASON NOT TO.
Now in the case of Microsoft, which really is a Little Endian UTF-16 shop through and through. So much so that you get other weird stuff happening like I mentioned in unicodeFFFE... is Microsoft off its rocker?.
In fact, it only gets truly weird in cases when the cases pop up where .Net is on a platform where UTF-16 little endian may not be a sensible assumption, but thankfully such cases are mercifully few.
Or maybe you have such a case in front of you? :-)
Karellen on 22 Aug 2010 8:37 AM:
"Microsoft, which really is a Little Endian UTF-16 shop through and through."
Yes, and that's fine <em>if</em> you assume that Microsoft products only talk to other Microsoft products. Wow, it's a good job that Microsoft products don't have to talk to any other products, from any other vendors, on any other OSs, on any other architectures, on some kind of vast, global, heterogeneous internetwork of computers, isn't it.
Wait, what? Is MS still living in 1990 or something? Seriously?
Michael S. Kaplan on 22 Aug 2010 4:51 PM:
Microsoft includes ways to talk and interoperate with those other products and platforms -- and a huge user base that has neither the interest nor the need for the additional complication....
Cheong on 22 Aug 2010 6:38 PM:
Btw, on platform such as Linux, the Unicode they usually use is UTF-8 (with doesn't have LE/BE issue), so just translate your "text to send to interop" to UTF-8 first and you'll be fine.
I'd think the same goes for BSD, Mac or other *nix systems.
Dan Bishop on 23 Aug 2010 5:27 PM:
Why not just name cp1200 "utf-16LE" and remove any doubt about the byte order?
Michael S. Kaplan on 23 Aug 2010 11:40 PM:
You mean change it and break anyone relying on the old result? I think the answer is in the question.