by Michael S. Kaplan, published on 2010/02/23 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/02/23/9967789.aspx
NOTEPAD adds a BOM (Byte Order Mark) when you save a file in the UTF-8 encoding.
You'd think that since Windows Notepad has been doing this for over 319680000 seconds2, and that the combined usage of Windows 20003, Windows XP, Windows Server 2003, Windows XP 64-bit, Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2 is so high that it may well blow your mind to calculate the number, that people would have gotten over this by now.
As recently as yesterday4, people were making comments again in that Why are UTF-8 encoded Unix shell scripts *ever* written or edited in Notepad? blog, the one where I officially suggested that these people who don't like the Notepad behavior of inserting a BOM in front of UTF-8 files had a simple remedy:
STOP USING WINDOWS NOTEPAD!
Yet for some reason people are still arguing it.
Please give up, it is over. If you were in a contest or duel for this5, then you have lost the contest, been bested in the duel. The game is over6.
A long time ago, someone decided that:
you should not be prompted8 in a way like this:
and so that was the way the feature was coded.
There is probably an alt.i.hate.microsoft newsgroup somewhere on USENET that would be happy to hear your complaint on the matter.
But the world has moved on.
And Notepad (the apparent premiere tool of UNIX shell script authors throughout the world) has let down a segment of customers who could have updated whatever is reading the scripts in less than a day, rather than complaining about this on and off for the last ~37009 plus days.
Your sacrifice is appreciated.
But please, go home now.
P.S. Isn't there some tool on UNIX that does this correctly10?
P.P.S. I will not include a screenshot of my private Notepad; I'm not trying to tease you here that badly....
1 - Well, not on the private Notepad I build from time to time from the Windows source, but that one is not one that is released to the public.
2 - Over ten years, give or take
3 - Where this first started happening.
4 - The day before today.
5 - Which none of you were, who are you kidding?
6 - Even more over than the Canadians in that game last night.
7 - Which ironically, most UNIX shell scripts are.
8 - This is a cool feature too, by the way.
9 - Over ten years, give or take.
10 - By your definition of "correctness", at least - a BOM-less UTF-8 save.
# hair beauty on 23 Feb 2010 7:20 AM:
STOP USING WINDOWS NOTEPAD!
# Paul Clapham on 23 Feb 2010 12:44 PM:
But that's not a remedy at all. The usual problem is that some other donkey (not me) thought that Notepad was a suitable editor for producing an XML document.
However I'm not one of the people complaining. It's quite obvious to me that even if somebody at Microsoft did decide to fix the problem (or however you want to look at it) it would take years for the problem to actually go away. So it's easier to just deal with the situation.
# Michael S. Kaplan on 23 Feb 2010 12:49 PM:
For the XML case, you can blame the parser. From the spec [Emphasis mine]:
4.3.3 Character Encoding in Entities
Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification do not apply to related character encodings, including but not limited to UTF-16BE, UTF-16LE, or CESU-8.
Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.
# William Reading on 23 Feb 2010 1:11 PM:
Notepad++ supports syntax highlighting and multiple encodings. It seems silly to edit unix shell scripts with vanilla notepad, especially since it doesn't represent linebreaks in the same way.
# Seth on 23 Feb 2010 1:12 PM:
Sweet, a blog post just for me!
I'm not entirely sure what you mean by 'game over,' but it seems to be something along the lines of 'this is what Notepad does and that's final.' I have to agree that that's probably never going to be fixed (though developers for new apps probably have other options today, like using a file attribute, say). The thing we can forever lament (loudly, at every opportunity) is that people don't take your advice and we end up with people complaining about problems like http://support.microsoft.com/kb/301623
And then there's another kind of problem:
Anyone who agrees with Peter Constable's reasoning: "for better or worse, plain text processes that support UTF-8 are going to encounter UTF-8 data beginning with a BOM: learn to live with it!" should also agree that processes that support UTF-8 are going to encounter UTF-8 data that doesn't begin with U+FEFF, and that they should 'learn to live with it.' For what reason wouldn't cl.exe support a switch for encoding?
# John Cowan on 23 Feb 2010 2:17 PM:
To be fair, that language about UTF-8 BOMs wasn't added by the XML Core WG until the Third Edition of 2004, six years after XML first became a W3C Recommendation.
Not feeling as belligerent today as when your previous post came out, I guess.
# Mihai on 23 Feb 2010 3:30 PM:
"Why are UTF-8 encoded Unix shell scripts *ever* written or edited in Notepad?"
Ok, I am not happy about Notepad doing that, but I got over it.
I would ask something else though: if UNIX/Linux claims to be Unicode aware at least a little bit (after 19 years of Unicode), why the shell does honor the BOM, using it to detect the UTF-8 even if (for some crazy reason) my locale is set to en_US.iso88591?
Any reason to choke on it?
# Cheong on 23 Feb 2010 5:25 PM:
But... for such an old question, people had developed bash/tcsh scripts that performs "replacement of \r\n to \n" and "trim BOM from the begining of files" in the specified folders. *nix fans should be familiar with usage of such tool(s).
Why should people complain when handy workarounds are readily available? :O
# Dranorter on 23 Feb 2010 9:27 PM:
Just happened upon your blog because of some posts about Khmer. (I've always been interested in Unicode and now I'm trying to teach myself a thing or two about how foreign writing systems actually work.) Just thought I'd say, I like what you have to say here and might keep reading. :)
# Doug Ewell on 24 Feb 2010 5:48 AM:
@Seth: "I have to agree that that's probably never going to be fixed"
How would you propose "fixing" it? With the message box in Michael's post?
# Yuhong Bao on 25 Feb 2010 11:00 AM:
What about the Subsystem for UNIX-based Applications?
# Michael S. Kaplan on 25 Feb 2010 6:04 PM:
I have been waiting for a long time to suggest that for a reason -- and no one has, yet!
# Seth on 25 Feb 2010 10:45 PM:
One way to maintain that feature of Notepad without prohibiting the storage of UTF-8 text that doesn't begin with a certain character would be to stick the encoding metadata in a file attribute. Someone else may be able to think of a better way though.
Another possibility I hope to see someday is Windows using a UTF-8 codepage. Michael has said that's not possible, and if that's the case then it's a real shame that we'll never be able to move away from using legacy encodings as the default.
# Michael S. Kaplan on 26 Feb 2010 2:38 AM:
Seth, there is no file attribute that can be supported on every file system that Windows can support as not all of them have such mechanisms.
And a UTF-8 "code page" (which exists now -- 65001) would not change the nature of the problem.
The signature fixes the problem, and since the user who might hit that bug has no workarolund while the user of Unix shell scripts has many, the current resolution, determined over a decade ago, can and will stand.
# Seth on 26 Feb 2010 11:48 AM:
I guess I should have specified 'using a UTF-8 codepage _by default_', i.e. set the CP_ACP to UTF-8.
Currently Notepad has the option to save files using the CP_ACP, UTF-16 and UTF-8. As the UTF-8 signature is intended to disambiguate between files specified as UTF-8 the ACP, if the ACP and UTF-8 options were the same then that dialogue you showed would never be needed even without a disambiguating mark. It seems to me that that does change the nature of the problem.
"The signature fixes the problem, and since the user who might hit that bug has no workarolund __ while the user of Unix shell scripts has many, the current resolution, determined over a decade ago, can and will stand."
Again, I think the solution you offered (STOP USING WINDOWS NOTEPAD!) is perfectly reasonable. As long as that solution is always preferred over, say, changing standards to require parsers to accept a ZWNBS that doesn't fit into their grammar at the beginning of files, then there's little to complain about here. (Though we can still complain about other programs failing to handle UTF-8 that doesn't begin with ZWNBS.)
Re: file attributes
Well, perhaps it's enough if just NTFS supports it. Or maybe someday the file system support could be extended to support attributes on any file system the way some other systems support storing their arbitrary attributes even on e.g. FAT32.
I don't care too much about this, since really I'd like to just get away from legacy encodings, but something like this is necessary to support legacy encodings. Without it you have to put a special signature at the beginning of all text files to identify which encoding, or you have to guess at encoding, or you just have to put up with potential data corruption when a file created on a system with one code page is edited on a system with a different code page.
1. I would suggest that the dialogue box you showed, or better yet, a dialogue that presents the encoding options directly, _is_ a workaround.
# Michael S. Kaplan on 26 Feb 2010 1:27 PM:
The current design of over a decade is the design. There are easily thousands of alternate tools that can be used here if one has other requirements, my advice is to look into one of those (I can't imagine many people holding out for hope of change after waiting over 10 years!)....
Yuhong Bao on 9 Mar 2010 12:01 AM:
Or you can write a program to truncate the BOM from the beginning of file. The hard part is while the Windows API make it easy to truncate from the *end* of a file, it do not make it easy to truncate from the *beginning* of a file.
Nemo on 31 Mar 2010 2:30 PM:
Just because you've done it wrong for a decade is no reason not to fix it now... also, while fixing that, how about having notepad recognize Unix-style line endings, like every other editor on the planet?
The correct solution: save UTF-8 as default, use it everywhere in Windows, strongly deprecate legacy encodings, and stop adding a stupid BOM where it doesn't belong.
Personally, of course, I DON'T use notepad. But it's the default text editor in Windows, which means that a lot of other people do use it, which means that I have to deal with what notepad does, and/or patiently explain over and over again why notepad is crap. So telling me "stop using notepad" is not helpful.
malcontent on 31 Mar 2010 3:40 PM:
Ms has programmers right?
Why can't they just improve it. You know.. Fix it.
Or replace it with something better. Maybe something open sourced.
Or buy any one of ten thousand better editors and rename it to "notepad.exe".
I mean Ms has money right? They have programmers right?
Why can't they do that?
Michael S. Kaplan on 31 Mar 2010 4:48 PM:
I suppose people could do what I did/do -- work for Microsoft, get access to the source, and build your own after making the changes you want.... :-)
Kale on 31 Mar 2010 6:11 PM:
One word for everyone: vim
Cheong on 31 Mar 2010 6:34 PM:
@malcontent: Because Notepad is designed to be a lightweight tool, it can't afford to be too complex.
You suggestion will be fine if you propose the replacement for Wordpad, it just doesn't make sense for Notepad.
Paul Betts on 31 Mar 2010 6:49 PM:
Re: footnote 9, Vim is of course the tool of choice for editing shell scripts, even though it *does* understand the BOM.
Robert on 31 Mar 2010 6:54 PM:
If you are creating a script for Unix on Windows using Notepad you are an idiot to the nth degree.
Miral on 31 Mar 2010 7:12 PM:
I'm probably coming in on this argument late, but why don't people like the BOM?
Personally, I think it should be absolutely mandated that every UTF-8 file must start with a BOM, just like UTF-16 files. Otherwise there isn't any useful way to tell whether any given file is in UTF-8 or ANSI, short of scanning the file and hoping to encounter useful characters you can use to guess correctly (which seems undesirable).
And for the people arguing "just ignore the ANSI files, they're old and crusty" -- do you *know* how many plain ANSI files there are lying around? Do you *really* think it'd be a good idea to render a significant number of them unreadable?
Stephen Harrison on 31 Mar 2010 7:46 PM:
I recently made the mistake of making a robots.txt file in Visual Studio, Google tools and a robots.txt validator then complained that my first line wasn't valid.
Guess what, the Visual Studio text editor had added the BOM marker, not great for adding something so simple as the robots.txt to a website!
Maybe it's time for me to stop using Visual Studio?
Michael S. Kaplan on 31 Mar 2010 9:56 PM:
Kind of funny that Google tools can't handle a BOM. I mean, with people like President of Unicode working for them and all... ;-)
Michael S. Kaplan on 31 Mar 2010 10:02 PM:
But with that said, Visual Studio gives one the option to save UTF-8 without a BOM, and to change the line endings:
Sachin on 31 Mar 2010 10:30 PM:
I use VIM...it is the best
Yuhong Bao on 31 Mar 2010 10:41 PM:
"I suppose people could do what I did/do -- work for Microsoft, get access to the source, and build your own after making the changes you want.... :-)"
Or disassemble Notepad in something like IDA Pro and do binary patching, after all the code to add the BOM is in Notepad itself.
Yuhong Bao on 31 Mar 2010 10:51 PM:
In fact, I just did this, and it turned out it would be very easy. I found the SaveFile function in the Vista version of Notepad, and found the WriteFile call that inserts the UTF-8 BOM. You can patch this out with NOPs using a debugger or by binary file editing. Yep, the curse of non open-source software.
David on 31 Mar 2010 10:55 PM:
To paraphrase the manual, EDIT is the standard editor.
Michael S. Kaplan on 31 Mar 2010 11:41 PM:
That's okay, Yuhong -- though working for Microsoft gives you better benefits. :-)
Yuhong Bao on 31 Mar 2010 11:55 PM:
"and found the WriteFile call that inserts the UTF-8 BOM"
BTW, there are several similar WriteFile calls, you have to patch out the right one. A key clue here is that in the x86 Vista SP2 version it pushes a pointer to the symbol _BOM_UTF8 as a second parameter. If you follow that pointer, you should see the bytes of the UTF-8 BOM.
Yea, it is an unfortunate curse of non open-source software and the Microsoft development process that a simple option like this would be so hard to add.
Michael S. Kaplan on 31 Mar 2010 11:57 PM:
Or you can just use WordPad. Or a zillion editors that you don't risk breaking....
Yuhong Bao on 31 Mar 2010 11:59 PM:
Of course. Not my point, though, I already know this.
Yuhong Bao on 1 Apr 2010 12:04 AM:
In fact, I am now off to WinDbg disassembling notepad!SaveFile and trying this patch myself.
Yuhong Bao on 1 Apr 2010 12:08 AM:
Found the right WriteFile call and just used the WinDbg E comment to enter NOP (0x90) instructions in place of it. I had to take some time to ensure I enter the right number of NOPs and no less and no more.
Jan on 1 Apr 2010 12:10 AM:
STOP WRITING BLOG POSTS ABOUT PEOPLE COMPLAINING ABOUT NOTEPAD!
Yuhong Bao on 1 Apr 2010 12:12 AM:
Just opened the test.txt file that was saved as UTF-8 using the patched in memory Notepad in HexEdit and confirmed that no BOM was inserted.
maht on 1 Apr 2010 2:55 AM:
When we invented UTF-8 at Bell-Labs it didn't have BOM. It was intentional. As usual, you fools messed it up.
Davin on 1 Apr 2010 4:09 AM:
My guess? They're using the notepad in Wine for a bit of nostalgia :).
ReallyEvilCanine on 1 Apr 2010 6:26 AM:
> P.S. Isn't there some tool on UNIX that does this correctly
pico, which is as narrowly focused and limited in scope as Notepad.
There's also MinEd (http://towo.net/mined/features.html) if you need flexibility.
ScriptErrorRetrying on 1 Apr 2010 9:17 AM:
> P.S. Isn't there some tool on UNIX that does this correctly
Any decent text editor on unix or linux will let you specify the encoding to use when writing files, and will likely default to either ISO8859-1, or UTF-8 WITHOUT stupid BOM.
E.g. GNU Emacs: set-buffer-file-coding-system (C-x RET f) will let you choose from a large range of utf-8 variants (and many other encoding schemes, including various ebcdic variants...). The -with-signature ones write a BOM. The -unix/-dos/-mac ones use LF / CRLF/ CR line endings.
Shameer on 3 Apr 2010 10:01 AM:
50 years back, these people would have been complaining about 3:2 pulldown.
When something becomes a de-facto standard, you live with it and try to make it work.
2010/08/14 (It wasn't me)
go to newer or older post, or back to index or month or day