by Michael S. Kaplan, published on 2005/01/20 02:07 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx
(The alternate title should be spoken with either a circa-1982 Jeff Spicoli or circa-1989 Theodore "Ted" Logan mannerism and accent)
U+feff has two jobs in the Unicode standard:
Job #1, and its namesake, is as a ZERO WIDTH NO-BREAK SPACE. The name kind of says it all. After all, we have U+00a0 (NO-BREAK SPACE) and U+200b (ZERO WIDTH SPACE). Yet we somehow needed to combine the two to create a character that has no width yet you should not break a line between what is on one side of it and what is on the other. Later they decided to add a different character is the preferred one for this job (U+200d, ZERO WIDTH JOINER), in part due to U+feff's violation of the moonlighting agreement and its apparent lack of focus (see job #2). But at its heart a conformant Unicode application does not have to do anything special with U+feff because this is a character that has no width. If it is between two characters and you completely ignore it, then you will get identical results to it not being there at all.
Update 21 January 2005 -- TLKH pointed out that the actual character that took over job #1 after U+feff was fired (well, depracated) from this job is U+2060 (WORD JOINER) and not U+200d (ZERO WIDTH JOINER).
Job #2 is to act as a BOM, a Byte Order Mark. A signature at the beginning of a file with no "wrapper" to indicate its encoding -- someone could look at the byte stream and know by the pattern of bytes what the encoding might be:
00 00 fe ff UTF-32, Big Endian
fe ff 00 00 UTF-32, Little Endian
fe ff ## ## UTF-16, Big Endian
ff fe ## ## UTF-16, Little Endian
ef bb bf UTF-8
And that last line is where it starts to get weird for people. Because lots of the folks who support UTF-8 in other standards like XML note that you do not need the BOM when you have other means to document the encoding and which use UTF-8 as their default encoding when none is specified anyway. And lots of other folks who support Unix tools that did not have to be completely changed to support Unicode by using UTF-8 do not like these extra three bytes at the front of the file. Sometimes that is because they really only support ASCII or ISO-8859-1, other times it is because they just can't handle those three bytes right in front but later on would not matter.
(Yes, I know -- boo, hiss, etc.)
Microsoft has an application called Notepad. Application is perhaps an overstatement; its just an uber-wrapper for a Win32 EDIT control. It stores the text in the edit control, which means if you open a file it must literally load all of the data to stick it into the control (which is by the way why it takes forever to open huge files). Over the years minor features and tweaks have been added. But in its soul it is just a plain old edit control.
When it was ported to NT the option of saving Unicode files had to be there since Unicode was there, so they added it. It was Little Endian UTF-16 (that was all the platfom really supported back then) but they just called it Unicode since it was vaguely more likely that someone might have heard of that. And that is what the rest of the platform was doing.
Then in Windows 2000 they added the ability to save a file as Big Endian UTF-16 and since they were already calling Little Endian UTF-16 Unicode they decided to call the other form Unicode (Big Endian). I do not think this is so bad, certainly less controversial than calling it unicodeFFFE1, but it definitely did irk some people who did not like one format being called Unicode as if others were not.
Incidentally, I those people are probably right. But the number of people who don't really care what it is called since they will never use it does outnumber all of the people who do care, so I kind of understand the logic behind the lack of detail that would confuse...
But then the worst sin of all was committed -- Notepad also added UTF-8 support. And of course the issue with the BOM had to come up.
The folks on the Shell team who did this recognized that if the file only had ASCII characters that it could be called UTF-8 or it could just be using the default system code page. So if a user intentionally saved it as UTF-8 then they would be confused if opening it again would not appear to remember that it had been saved in such a way. So they add a BOM when it is UTF-8, to tag it as UTF-8 in a way that is 100% conformant with Unicode.
This is completely legal and since Notepad is just a simple "Hans & Franz" wrapper around an EDIT control, it has no other means of understanding "envelope" information to tell anyone what the encoding is. What else could they do? The bug is in the people who use Notepad to edit HTML and XML, because they do not require a BOM. People still use it as a convenient editor of files, but the caveats are pretty clear....
People like Raymond Chen have been posted about how Some files come up strange in Notepad but generally people do not have complaints about the way Notepad behaves.
But every 4-6 months another huge thread on the Unicode List gets started about how bad the BOM is for UTF-8 and how it breaks UNIX tools that have been around and able to support UTF-8 without change for decades2 and about how Microsoft is evil for shipping Notepad that causes all of these problems and how neither the W3C nor Unicode would have ever supported a UTF-8 BOM if Microsoft did not have Notepad doing it, and so on, and so on.
We are about 30+ messages into such a thread right now, believe it or not. That did not inspire this post so much as the image of Sean and Keanu talking about it like surfer dudes did, though. :-)
No one ever has answers about the fact that if someone really supports Unicode, they should be able to handle a ZERO WIDTH NO-BREAK SPACE without breaking a sweat. If they can't, their tool or utility or application or whatever they have is broken, and it's their bug, not Microsoft's. At least, if they claim to support UTF-8, that is. Tools that support "all of UTF-8 as long as it starts with ASCII" and tools that cannot handle these three bytes at all are not really supporting UTF-8.
And by the way, that includes Microsoft applications, too. In my opinion Frontpage 2000 kinda stunk (all things considered) because of this problem. Even though they added the cool "don't screw with my HTML setting that I liked so much. I was very happy when Frontpage 2002 and 2003 fixed this problem. Just like I'm sure most others would be happy if people fixed their tools, as well....
I thought I'd briefly quote one of the posts to the Unicode List that was just done by Peter Constable:
As for whether plain text files can have a BOM, that is one of the few unending debates that arise with certain (fortunately not too freguent) regularity, each time with vociferous expressions of deeply-held beliefs but never any resolution. I'll just observe that the formal grammar for XML does not make reference to a BOM, yet the XML spec certainly assumes that a well-formed XML document may begin with a UTF-8 BOM (or a BOM in any Unicode encoding form/scheme). Rather than have a philosophical debate about the definition of "plain text file", I suggest a more pragmatic approach: for better or worse, plain text processes that support UTF-8 are going to encounter UTF-8 data beginning with a BOM: learn to live with it!
I agree 100% with his words and wish I coulsd summarize the issues as cleary and as effectively as he can. :-)
For the record, it has occurred to me in the past that it would not be a bad idea to add an option to save files without the BOM. Of course that would mean having to document it for people who probably struggle with the difference between Unicode and Unicef3. That does make this something of an uphill battle (doc. changes are the hardest and most resource intensive in changes like this), but perhaps worthy of a try. Maybe they could take out some of that "UTF-8 is for legacy" stuff that is in Notepad help now while they are there. What do you think? :-)
1 - Believe it or not, unicodeFFFE is actually documented as Internet Explorer's Preferred Charset Label for Unicode (Big-Endian). Periodically people report the name as a bug, since there is no such code point in Unicode as U+fffe. But the reason for the name is that if you look at a BOM of UTF-16 big endian on a system that is little endian, it will look like FFFE. Since that is not a valid character, it is easy to tell on a Little Endian system that the file must be Big Endian Unicode. The name is just acting as a sensible (if somewhat platformily provincial) labelling of what one sees on almost 100% of all Windows platforms.
2 - Never mind that Unicode has not existed for that long, let alone UTF-8!
3 - Someone once asked me at a conference how saving a file is able to contribute to a charity, and was it like one of those fake email chain letter things on her machine? And I did not laugh, though I admit I smiled pretty broadly as I explained to her about how Unicode was not Unicef. And I did laugh a bit afterward.
This post is sponsored by "" U+feff (ZERO WIDTH NO-BREAK SPACE, of course)
Though he was a little bitter about the lack of visible representation here, I was unable to find the little guy to spray paint him so that you could all see him here today. He is between those quotes, I can promise you that.
# Mike Dunn on 20 Jan 2005 10:32 AM:
# Centaur on 20 Jan 2005 1:07 PM:
# Dean Harding on 20 Jan 2005 2:45 PM:
# Michael Kaplan on 20 Jan 2005 5:04 PM:
# Michael Kaplan on 20 Jan 2005 5:43 PM:
# Michael Grier [MSFT] on 20 Jan 2005 8:17 PM:
# Michael Kaplan on 20 Jan 2005 10:29 PM:
# Mo on 20 Jan 2005 11:42 PM:
# Serge Wautier on 21 Jan 2005 12:38 AM:
# TLKH on 21 Jan 2005 1:39 AM:
# Michael Kaplan on 21 Jan 2005 6:38 AM:
# Michael Kaplan on 21 Jan 2005 6:45 AM:
# Robert on 21 Jan 2005 11:51 PM:
# Michael Kaplan on 22 Jan 2005 12:40 AM:
# Michael Grier [MSFT] on 23 Jan 2005 3:22 PM:
# Michael Kaplan on 23 Jan 2005 3:42 PM:
# Chris Walker on 26 Jan 2005 8:26 AM:
# Michael Kaplan on 26 Jan 2005 8:39 AM:
# Anonymous on 11 Sep 2005 4:19 AM:
Yuhong Bao on 11 Mar 2009 11:34 PM:
"NT 3.1 shipped with an ASCII only Notepad. In the fall of 1993, several applications were converted to use Unicode. At this time, Notepad started using the BOM. I can't tell you how this decision was made; my memory isn't that good. We also converted other applications like Cardfile and Paintbrush. These first shipped on NT 3.5."
BTW, look at how hard it was to actually insert a Unicode-only character back then. For comparison, today, I use a English Windows PC with Chinese IMEs installed and I use Notepad to create Chinese text files, and the code page of English Windows is 437, which meant that I have to save my Chinese text files as Unicode.
Yuhong Bao on 12 Mar 2009 10:53 PM:
"NT 3.1 shipped with an ASCII only Notepad."
Do you mean ANSI-only?
colin on 30 Jan 2010 4:44 AM:
Your article says:
"fe ff 00 00 UTF-32, Little Endian"
I hope you didn't ship software that relies on that!
It's "FF FE 00 00"!
2010/08/14 (It wasn't me)
2007/01/13 RichTextBox breaking ranks?
2006/10/28 You can just byte me
2006/05/26 7-bit UTF-8?
2006/01/21 Return of the Mark
2005/09/11 unicodeFFFE... is Microsoft off its rocker?
2005/09/11 Working hard to detect code pages
go to newer or older post, or back to index or month or day