Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)

by Michael S. Kaplan, published on 2005/01/20 02:07 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx

(The alternate title should be spoken with either a circa-1982 Jeff Spicoli or circa-1989 Theodore "Ted" Logan mannerism and accent)

U+feff has two jobs in the Unicode standard:

Job #1, and its namesake, is as a ZERO WIDTH NO-BREAK SPACE. The name kind of says it all. After all, we have U+00a0 (NO-BREAK SPACE) and U+200b (ZERO WIDTH SPACE). Yet we somehow needed to combine the two to create a character that has no width yet you should not break a line between what is on one side of it and what is on the other. Later they decided to add a different character is the preferred one for this job (U+200d, ZERO WIDTH JOINER), in part due to U+feff's violation of the moonlighting agreement and its apparent lack of focus (see job #2). But at its heart a conformant Unicode application does not have to do anything special with U+feff because this is a character that has no width. If it is between two characters and you completely ignore it, then you will get identical results to it not being there at all.

Update 21 January 2005 -- TLKH pointed out that the actual character that took over job #1 after U+feff was fired (well, depracated) from this job is U+2060 (WORD JOINER) and not U+200d (ZERO WIDTH JOINER).

Job #2 is to act as a BOM, a Byte Order Mark. A signature at the beginning of a file with no "wrapper" to indicate its encoding -- someone could look at the byte stream and know by the pattern of bytes what the encoding might be:

00 00 fe ff      UTF-32, Big Endian

fe ff 00 00      UTF-32, Little Endian

fe ff ## ##      UTF-16, Big Endian

ff fe ## ##      UTF-16, Little Endian

ef bb bf         UTF-8

And that last line is where it starts to get weird for people. Because lots of the folks who support UTF-8 in other standards like XML note that you do not need the BOM when you have other means to document the encoding and which use UTF-8 as their default encoding when none is specified anyway. And lots of other folks who support Unix tools that did not have to be completely changed to support Unicode by using UTF-8 do not like these extra three bytes at the front of the file. Sometimes that is because they really only support ASCII or ISO-8859-1, other times it is because they just can't handle those three bytes right in front but later on would not matter.

Enter Microsoft.

(Yes, I know -- boo, hiss, etc.)

Microsoft has an application called Notepad. Application is perhaps an overstatement; its just an uber-wrapper for a Win32 EDIT control. It stores the text in the edit control, which means if you open a file it must literally load all of the data to stick it into the control (which is by the way why it takes forever to open huge files). Over the years minor features and tweaks have been added. But in its soul it is just a plain old edit control.

When it was ported to NT the option of saving Unicode files had to be there since Unicode was there, so they added it. It was Little Endian UTF-16 (that was all the platfom really supported back then) but they just called it Unicode since it was vaguely more likely that someone might have heard of that. And that is what the rest of the platform was doing.

Then in Windows 2000 they added the ability to save a file as Big Endian UTF-16 and since they were already calling Little Endian UTF-16 Unicode they decided to call the other form Unicode (Big Endian). I do not think this is so bad, certainly less controversial than calling it unicodeFFFE¹, but it definitely did irk some people who did not like one format being called Unicode as if others were not.

Incidentally, I those people are probably right. But the number of people who don't really care what it is called since they will never use it does outnumber all of the people who do care, so I kind of understand the logic behind the lack of detail that would confuse...

But then the worst sin of all was committed -- Notepad also added UTF-8 support. And of course the issue with the BOM had to come up.

The folks on the Shell team who did this recognized that if the file only had ASCII characters that it could be called UTF-8 or it could just be using the default system code page. So if a user intentionally saved it as UTF-8 then they would be confused if opening it again would not appear to remember that it had been saved in such a way. So they add a BOM when it is UTF-8, to tag it as UTF-8 in a way that is 100% conformant with Unicode.

This is completely legal and since Notepad is just a simple "Hans & Franz" wrapper around an EDIT control, it has no other means of understanding "envelope" information to tell anyone what the encoding is. What else could they do? The bug is in the people who use Notepad to edit HTML and XML, because they do not require a BOM. People still use it as a convenient editor of files, but the caveats are pretty clear....

People like Raymond Chen have been posted about how Some files come up strange in Notepad but generally people do not have complaints about the way Notepad behaves.

But every 4-6 months another huge thread on the Unicode List gets started about how bad the BOM is for UTF-8 and how it breaks UNIX tools that have been around and able to support UTF-8 without change for decades² and about how Microsoft is evil for shipping Notepad that causes all of these problems and how neither the W3C nor Unicode would have ever supported a UTF-8 BOM if Microsoft did not have Notepad doing it, and so on, and so on.

We are about 30+ messages into such a thread right now, believe it or not. That did not inspire this post so much as the image of Sean and Keanu talking about it like surfer dudes did, though. :-)

No one ever has answers about the fact that if someone really supports Unicode, they should be able to handle a ZERO WIDTH NO-BREAK SPACE without breaking a sweat. If they can't, their tool or utility or application or whatever they have is broken, and it's their bug, not Microsoft's. At least, if they claim to support UTF-8, that is. Tools that support "all of UTF-8 as long as it starts with ASCII" and tools that cannot handle these three bytes at all are not really supporting UTF-8.

And by the way, that includes Microsoft applications, too. In my opinion Frontpage 2000 kinda stunk (all things considered) because of this problem. Even though they added the cool "don't screw with my HTML setting that I liked so much. I was very happy when Frontpage 2002 and 2003 fixed this problem. Just like I'm sure most others would be happy if people fixed their tools, as well....

I thought I'd briefly quote one of the posts to the Unicode List that was just done by Peter Constable:

As for whether plain text files can have a BOM, that is one of the few unending debates that arise with certain (fortunately not too freguent) regularity, each time with vociferous expressions of deeply-held beliefs but never any resolution. I'll just observe that the formal grammar for XML does not make reference to a BOM, yet the XML spec certainly assumes that a well-formed XML document may begin with a UTF-8 BOM (or a BOM in any Unicode encoding form/scheme). Rather than have a philosophical debate about the definition of "plain text file", I suggest a more pragmatic approach: for better or worse, plain text processes that support UTF-8 are going to encounter UTF-8 data beginning with a BOM: learn to live with it!

I agree 100% with his words and wish I coulsd summarize the issues as cleary and as effectively as he can. :-)

For the record, it has occurred to me in the past that it would not be a bad idea to add an option to save files without the BOM. Of course that would mean having to document it for people who probably struggle with the difference between Unicode and Unicef³. That does make this something of an uphill battle (doc. changes are the hardest and most resource intensive in changes like this), but perhaps worthy of a try. Maybe they could take out some of that "UTF-8 is for legacy" stuff that is in Notepad help now while they are there. What do you think? :-)

1 - Believe it or not, unicodeFFFE is actually documented as Internet Explorer's Preferred Charset Label for Unicode (Big-Endian). Periodically people report the name as a bug, since there is no such code point in Unicode as U+fffe. But the reason for the name is that if you look at a BOM of UTF-16 big endian on a system that is little endian, it will look like FFFE. Since that is not a valid character, it is easy to tell on a Little Endian system that the file must be Big Endian Unicode. The name is just acting as a sensible (if somewhat platformily provincial) labelling of what one sees on almost 100% of all Windows platforms.

2 - Never mind that Unicode has not existed for that long, let alone UTF-8!

3 - Someone once asked me at a conference how saving a file is able to contribute to a charity, and was it like one of those fake email chain letter things on her machine? And I did not laugh, though I admit I smiled pretty broadly as I explained to her about how Unicode was not Unicef. And I did laugh a bit afterward.

This post is sponsored by "" U+feff (ZERO WIDTH NO-BREAK SPACE, of course)
Though he was a little bitter about the lack of visible representation here, I was unable to find the little guy to spray paint him so that you could all see him here today. He is between those quotes, I can promise you that.

# Mike Dunn on 20 Jan 2005 10:32 AM:

Strange things are afoot at the U+004B U+20DD ;)

# Centaur on 20 Jan 2005 1:07 PM:

The XML specification explicitly permits a UTF-16 BOM at the beginning of the file or stream. Otherwise, it must start with the XML declaration (<?xml version=…>), no whitespace or other characters allowed. At least that’s how I’d interpret sections 4.3.3 and 2.1.

# Dean Harding on 20 Jan 2005 2:45 PM:

Heh, I used to work for Unisys. I always felt bad correcting people when they thought I said "Unicef", cause suddenly I'm not such the good samaritan that they thought I was...

# Michael Kaplan on 20 Jan 2005 5:04 PM:

Mike Dunn -- something not Kosher? :-)

Centaur -- the XML spec allows the BOM; it even describes it. So anyone who does not allow it does so at their peril....

# Michael Kaplan on 20 Jan 2005 5:43 PM:

The Unicode FAQ talks about this issue a bit, also.

http://www.unicode.org/faq/utf_bom.html#BOM

With the number of bytes wasted in web/email communication over a character that takes up only 2-4 bytes in storage and no visible space, it is no wonder that people find Unicode to be complicated!

# Michael Grier [MSFT] on 20 Jan 2005 8:17 PM:

Back in visual studio, we had a few people who were really focussed on getting the editors to be really good Unicode citizens. My (possibly revisionist) history is that we actually introduced use of the utf-8 BOM over there around the time of win98 (vs 6). NT caught up when visual studio users were creating "text files" (whatever the heck /that/ means... :-) that other people couldn't open in notepad.

Re: so much attention:

My 1st dev mgr at Microsoft always noted that it was the little picayune issues that drew the most heated debates because everyone felt they understood /all/ the issues.

to quote Kosh: the avelance has started, it is too late for the pebbles to vote.

UTF-8 has a BOM and people just need to learn to love it. (The tricky question is when to preserve/not preserve a BOM found in a byte stream...) I think you're right; just because something is 8-bit clean doesn't make it a good utf-8 citizen. It has to be very careful not to split an encoding (just like a good UTF-16 citizen has to know not to split high/low surrogates...)

# Michael Kaplan on 20 Jan 2005 10:29 PM:

Interesting! I had not heard this before... but I guess the timing is right. I never remember trying UTF-8 in VS6, did it really work?

# Mo on 20 Jan 2005 11:42 PM:

I think the confusion reigns because people expect saving a file as UTF-8 to mean "Save it as UTF-8 if it contains non-ASCII characters, and ASCII otherwise", so they expect the BOM to be only present if characters with values greater than 127 are contained within the file.

# Serge Wautier on 21 Jan 2005 12:38 AM:

What is supposed to be the caret behaviour when encountering such a character ?

I pasted the sponsor message into Notepad and I noticed that even though you don't see the BOM, you can definitely 'feel' it when moving the caret : You need to press the arrow key twice between the 2 ".

Does it mean that it's not completely true to say that apps may safely ignore it, especially at the beginning of a doc: If the app provides edition of the contents, users will have a weird experience and bug reports will flood in !

Also, how does text rendering work ? The BOM is not in the font I use in Notepad.

# TLKH on 21 Jan 2005 1:39 AM:

As far as I remember from the time when I implemented unicode line breaking algorithm for my editor, U+200d allows breaking before/after it.
The real zero-width-non-breaking-space character (except for BOM) is U+2060, not mentioned in this article.

# Michael Kaplan on 21 Jan 2005 6:38 AM:

TLKH is right -- U+2060 (WORD JOINER) is the preferred character that took on the job formerly occupied by Job #1 of the ZWNBSP. I will put a correction in on the page).

# Michael Kaplan on 21 Jan 2005 6:45 AM:

Serge -- hard to say what the caret behavior should be here -- after all it *is* a space, even though it is zero width. The fact that it is deprecated makes it even less likely that implementations will do much more than ignore it....

# Robert on 21 Jan 2005 11:51 PM:

"For the record, it has occurred to me in the past that it would not be a bad idea to add an option to save files without the BOM."

It would be convienient if UTF-8 could be selected as the "ANSI" codepage in the control panel's advanced regional and language options. Then Notepad and many applications designed for ANSI would automatically support UTF-8 (without BOM). I would prefer this because nowerdays I rarely create text files with legacy ANSI encoding.

For those few applications that make specific assumptions about the ANSI codepage (hard-coded strings with character codes >= 128 etc.), AppLocale provides a good solution:

http://www.microsoft.com/globaldev/tools/apploc.mspx

(A UTF-8 "ANSI" codepage may cause problems if the API implementation depends on assumptions like "ANSI character <= double-byte").

# Michael Kaplan on 22 Jan 2005 12:40 AM:

Unfortunately, this is not possible -- there are too many bugs in Windows and in apps for components that will not work with UTF-8 here....

# Michael Grier [MSFT] on 23 Jan 2005 3:22 PM:

Re: vs6 and utf-8:

It did in the new shell ("vegas" as I recall). Only Visual Interdev and Visual J++ used the new shell.

# Michael Kaplan on 23 Jan 2005 3:42 PM:

Ah yes, that part is true, and I have actually used that before to look at some of the collation source files back before the VS.NET shell was solid enough for daily use.

I never knew it drove the Notepad feature, though -- thats cool. Fascinating how one piece of the company drives another, sometimes....

# Chris Walker on 26 Jan 2005 8:26 AM:

I'm been maintaining Notepad for Windows NT/2000/XP/Server 2003 for more than 10 years, so I know some of the history of it.

First off, any additional complexity to the interface has always had heavy pushback from management and user. Second, it has had to respond to various changes in commonly used character sets over the years.

Notepad has to guess the character encoding if it does not know what it is. It uses the IsTextUnicode() API to help it, but in the end, it is still a guess. It may be worth a blog entry to discuss just this API and its (mis)usage.

Notepad only edits Unicode, so all other formats are converted to Unicode when the file is opened and converted back if possible to its original format when saved. If the saved format is some form of Unicode, it will also output the BOM at the beginning of the file. If the file has a BOM, then there is no need to call the unreliable IsTextUnicode() API the next time it is opened.

Notepad remembers what the format of the file was when it was read in and uses this format as the default to save. If the edited file can not be saved in the same format without data loss, the user is warning when saving. Otherwise, no UI is thrown. The net result is ASCII files stay ASCII and Unicode files stay Unicode

Notepad will never send the Edit Control the BOM. It will skip the BOM if it exists.

History:

NT 3.1 shipped with an ASCII only Notepad. In the fall of 1993, several applications were converted to use Unicode. At this time, Notepad started using the BOM. I can't tell you how this decision was made; my memory isn't that good. We also converted other applications like Cardfile and Paintbrush. These first shipped on NT 3.5.

With the advent of the popularity of the Internet, other character formats needed to be supported. This was how support of the Big Endian Unicode came from. I'm sorry for the name that was used, but that was the best we could do at the time. The BOM helped a lot for this. I believe the first time Big Endian Unicode support was shipped was NT 5.0, er, Windows 2000.

It would be easy enough to have Notepad not output a BOM. It would *just* be a UI change in the SaveFile dialog.

The performance of Notepad on large files is not completely related to the fact that it reads the whole file into memory. Remember that in the ASCII file case, it has to convert the whole file to Unicode before it can start. As it turns out, this is very fast compared to the CPU bound work that the Multiline Edit Control does to build up some internal data structures. Of course, if reading the file into memory requires the OS to page to the pagefile, you are going to be hurting. Even after you get a big file swallowed, your experience editing this file will not be pretty. Just try adding for deleting a character at a time. The Edit Control exposes the memory to the application and is required that it be ready to save at any time. So you can imagine that adding one character will shuffle all the characters above it down one character. Fine for small files, but a killer algorithm for a large file.

One could add complexity to Notepad and solve some of these problems. One could had a "BOM" checkbox to the save dialog complete with an explanation as to what a BOM is and why the end user should care. One could add options to save ASCII files in various code pages complete with code page documentation. One could add a preview on the Open Dialog and allow the user to pick the proper character encoding. One could scan files looking for encoding text and if found use that as the default.

Notepad is not just a wrapper for the Edit control. In addition to the file encoding problems, it also has reasonable printing and a text search capability which has some interesting International issues. It hosts just about every Common Dialog (Find, ReplaceText, Print, PageSetup, Open, Save). As far as I can figure, it is only missing ChooseColor. I know this because bugs in Common Dialogs are often reported to me first.

An interesting blog entry would discuss the different End of Line sequence standards. Windows uses carriage-return/linefeed pairs as the legal EOL, while Un*x implementations tend to use newline. You can see this in Notepad when you load a file that just uses the newline as the EOL since the Edit Control uses the Windows standard and bare newline characters are not considered EOLs.

# Michael Kaplan on 26 Jan 2005 8:39 AM:

Awesome info, Chris! And I agree that some of the entries you mention would make good future posts

Especially IsTextUnicode (which is kind of a pain for us since we do not own it but everyone assumes we do), the EOL issue in from *nix-created files (which gets some complaints but much less often than the BOM issue).

Sorry if any of the backtalk about "just an edit control" caused grief -- it is a shorthand we sometimes when talking about rendering issues to distinguish it from WordPad (which is "just" a wrapper around RichEdit, in the same sense).

Obviously both are more than "just" a wrapper so I will choose those words with more care. As a team that finds most of its bug reported through applications that use us even if the bug is not ours, I can understand bugs being reported that actually are not ours....

But thankis for the info, it makes the post itself markedly better to have the full story from one who knows rather than after-the-fact guesses and suppositions.

I don't suppose you know who we should contact now if we wanted to push the BOM issue? (feel free to send it to me offline if you're prefer). :-)

# Anonymous on 11 Sep 2005 4:19 AM:

Yesterday, Buck Hodges was talking about how TFS Version Control determines a file's encoding: ...

Yuhong Bao on 11 Mar 2009 11:34 PM:

"NT 3.1 shipped with an ASCII only Notepad. In the fall of 1993, several applications were converted to use Unicode. At this time, Notepad started using the BOM. I can't tell you how this decision was made; my memory isn't that good. We also converted other applications like Cardfile and Paintbrush. These first shipped on NT 3.5."