unicodeFFFE... is Microsoft off its rocker?
by Michael S. Kaplan, published on 2005/09/11 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/11/463444.aspx
This is an issue that has been around for a long time.
Back in February (geez, I really have been blogging almost a year now, haven't I?), I explained the difference between Big Endian and Little Endian Unicode. In January I also talked about the Byte Order Mark.
Neither of them are what this post is about.
This post is about the Preferred Charset Label for web pages that are encoded with Big Endian Unicode (or 'Unicode big endian' as Notepad likes to call it).
It is indeed unicodeFFFE.
"But Michael, that is not a valid Unicode code point!" cry some.
"But Michael, that is not what the big endian BOM looks like in memory if one is looking at the bytes!" cry others.
"But Michael, that is not what the big endian BOM looks like on Big Endian systems!" cry some of those remaining.
"Michael, is Microsoft off its rocker?" exclaim a few of the rest (their language is at time less polite, but one email used this language so I decided to go with it).
And believe it or not, there are actually bugs raised by people on several different product teams over the years, who are unhappy with one or more of the following:
And then some the words from people at Microsoft....
"The byte-order mark for big-endian unicode is FEFF, so this should be UnicodeFEFF. This seems like a valid complaint, but I was wondering if it'd break something else to change it." explains Shawn Steele, the development owner of encodings in Windows and the .NET Framework.
"I think this a mistake in the original MLang data, but we have to keep it for compatibility." explains the developer who used to own MLang, now the MUI Development Lead.
"Yes, it was a misnomer that we inherited from MLang. It’s too late to change that." explains the NLS Development Lead.
"Yes, this was wrong in the initial implementation. But now that apps are coded to it, we cannot change anymore." explains Software Architect Chris Lovett on the SQL Server team.
But the original truth about why it was in MLang in the first place is not quite this insidious. Basically, Windows (and Microsoft) are predominantly Little Endian shops (even when platforms that supported BE ran Windows like Alpha, they used LE on the installs). And when someone on a little endian system reads it in as if it were a WCHAR (thinking it to be a UTF-16 LE code unit), they see 0xFFFE, which is of course not a valid Unicode code unit. Thus it is easy it is easy to see it as a big endian file.
The BOM is always U+FEFF. Always. ALWAYS. But that means that in memory it is (in BYTEs):
- 0xff 0xfe when it is the little endian BOM on any system;
- 0xfe 0xff when it is the big endian BOM on any system.
This is because big endian sytems take the first (big) byte first, where little endian systems take that seond byte first. Which means that in memory it is (in WORDs):
- 0xfeff when it is the little endian BOM on little endian systems;
- 0xfffe when it is the big endian BOM on little endian systems.
- 0xfffe when it is the little endian BOM on big endian systems;
- 0xfeff when it is the big endian BOM on big endian systems.
Try it yourself on any platform you happen to have handy if you don't believe me. :-)
The semantic is clear and unambiguous, just not documented very well, and perhaps some would call it a rather silly way to think of it. The name is just acting as a somewhat sensible (if somewhat platformily provincial) labelling of what one sees on almost 100% of all Windows platforms.
And as people already pointed out, it is a bit late to be talking about changing it....
This post brought to you by U+fffe, a permanately reserved code unit in Unicode so that BOM determination can remain easier....
# Steven on 11 Sep 2005 9:22 AM:
> 0xffff when it is the little endian BOM on big endian systems;
Is that right? 0xfffe would make more sense.
# Michael S. Kaplan on 11 Sep 2005 9:26 AM:
<P>It would again look like 0xFFFE -- that is always what you will see on any system if you look at it as a WORD and it the wrong endian for the platform.</P>
<P>Good catch though -- typo now gone. :-)</P>
# CornedBee on 11 Sep 2005 9:49 AM:
Just what is wrong with calling it "bigendian"? That's absolutely unambigous, and thus superior to "fffe" or "feff", as the confusion over the name has shown.
# Michael S. Kaplan on 11 Sep 2005 9:54 AM:
Nothing is wrong with it -- but it is too late. The fact is that the old name is now out there -- changing it would break a ton of existing clients.
# Nick Lamb on 11 Sep 2005 3:58 PM:
I guess I should explain the source of the complaint from us Unix folks, since as usual Microsoft employees have tried to dodge the issue and make us look like idiots (idiots who somehow had Unicode working properly, with backwards compat. and everything while they fumbled around corrupting people's data...).
Unix has a lot of formatted text files. That is, they're not just a bunch of text which someone happens to be keeping in a file, they have meaning. That meaning is changed (usually to its detriment) if you insert or remove arbitrary data.
Now codepoint 0xfeff may not /look/ like anything, but that doesn't avoid it having meaning. So when a parser is looking for character 0x23 (#) and it finds 0xfeff that's a non-match. So a file which meant "please execute this non-interactive shell" before is nothing but gibberish after Microsoft's tools insert this 0xfeff "marker" character. If instead we find part of the file where arbitrary text is permitted (e.g. a comment, or a filename) and we insert Unicode text, we see that it works fine.
We can do a Raymond Chen thought experiment here. What happens if our popular spreadsheet software decides to pull the same dirty "that character is just a marker" trick on Microsoft's Excel XLS files ? There are all those nice character strings in them which we can mark.
We'll insert the Unicode / ASCII character 0x0 NUL into some of Excel's data strings as a "marker" for our software, and assume that Excel can ignore that (after all, it's a null character, right?). Now we try to load the file back into Excel and... surprise, Excel says it is corrupt and we lose our data.
Note that all of the above post (about trying to figure out whether you're looking at UTF-16 in big endian or little endian form) would be irrelevant if Microsoft had, like everyone else, simply accepted UTF8 as an on the wire/ on disk storage format. Once again Microsoft's desire to be anything-but-Unix costs their customers a lot of time and money for no gain.
# Michael S. Kaplan on 11 Sep 2005 4:15 PM:
Hi Nick -- Well, as I mention in the BOM post referenced above, this is not just about Notepad. And it is about dealing with things as they are, not just as we want them to be.
What Notepad does was not done to specifically break Unix, any more than the Unix refusal to handle something that is legal in text is done specifically to thumb a nose at Microsoft.
Each platform and application has reasons for how they behave, and maybe instead of taking the oppotunity to slam Microsoft in predictable fashion, I have a suggestion for the affected UNIX people:
DON'T USE NOTEPAD ON WINDOWS TO EDIT YOUR UNIX SHELL SCRIPTS!
Then the problem is solved. :-)
# Mihai on 11 Sep 2005 5:41 PM:
Hitting a bit left and a bit right, to make everyone is angry :-)
I am also unhappy with Notepad adding the BOM on UTF-8. Although I can understand why (sometimes) this is a good thing, I don’t agree if it opens a no-BOM UTF-8 file, correctly identify it, then add a BOM when saving.
How I would want it:
- if the original had no BOM, then save should add no BOM
- if the original had BOM, then save should keep the BOM
- "Save As" should give me the option to save with or without BOM
Ok, everyone happy?
Now, Unix and Linux: my problem is with these systems being 100% agnostic about encodings. Although convenient for legacy, is 100% bad for a lot of things.
Let’s say I get a UTF-8 shell script, no BOM, but my LANG is set to ja_JP.EUC-JP (or Russian with KOI-R if you like :-) What happens? The script is not identified as UTF-8 and I get junk. If the script does a set LANG=ja_JP.UTF-8, this means the full script has to be reloaded as UTF-8 (if already cached)? Or lines are converted one by one? This means I can set the LANG to KOI-R, execute a KOI-R line, then EUC-JP and execute a EUC-JP line, then UTF-8 and execute a UTF-8 line. A multi-code script, with no marker to tell me which is what!
Same with the file system: set your locale to ja_JP.EUC-JP, create a file with Japanese name, then set your locale to UTF-8. Now, you cannot open the file, because the EUC-JP sequence of bytes is invalid UTF-8 sequence! Auch!
"My way is the only way" vs. "You can do whatever you want, even hang yourself"
Is one better than the other? Depends. I like it my way. Maybe sometimes I want to hang myself. But usually not :-)
# Jonathan on 12 Sep 2005 3:09 AM:
I think using a byte-order-dependent name for an encoding that's different from another encoding only by byte order is kind of strange. UnicodeBigEndian would be much clearer, as notepad correct uses.
And who uses UTF16 on web pages anyways? I thought everyone just uses UTF8 or legacy codepages...
# Doncho on 10 Mar 2008 1:54 PM:
It's quite simple, Microsoft messed it up and did absolutely nothing to fix it (not even an deeply hidden option in notepad).
So, most users use Windows, so... if we break text files this way (not help fix the problem in any way) all Unix users will get problems... which is good for us. Sounds convincing, doesn't it?
# Michael S. Kaplan on 10 Mar 2008 2:15 PM:
Hmmmm. Not sure what on earth that has to do with Big Endian UTF-16, which UNIX chokes on just as often as little endian UTF-16.
If UNIX wants to claim to have UTF-8 support except for that one character (the BOM) then I have a simple solution: STOP WRITING YOUR UNIX SHELL SCRIPTS IN WINDOWS NOTEPAD AND SAVING THEM AS UTF-8!!!!
Yuhong Bao on 11 Mar 2009 5:59 PM:
"Note that all of the above post (about trying to figure out whether you're looking at UTF-16 in big endian or little endian form) would be irrelevant if Microsoft had, like everyone else, simply accepted UTF8 as an on the wire/ on disk storage format. Once again Microsoft's desire to be anything-but-Unix costs their customers a lot of time and money for no gain."
MS was an early Unicode adopter and UTF-8 was created in 1992, by then it was too late, the first NT betas has already come out and NT was released in 1993.
Please consider a donation
to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
go to newer or older post, or back to index or month or day