Why are UTF-8 encoded Unix shell scripts *ever* written or edited in Notepad?

by Michael S. Kaplan, published on 2008/03/11 09:21 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/03/11/8152232.aspx


Please read the disclaimer; content of Michael Kaplan's blog not approved by Microsoft!

Everybody hates Microsoft.

Well, not everybody.

But hating Microsoft seems awfully popular....

It seems like to try to be the best at anything you have to make choices that lots of people won't like. And then before you know it, people are hating you.

Everyone hates what Microsoft does with the BOM (Byte Order Mark). That thing I talked about in Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!).

Lots of people hate it so much that they will complain about it when it is not completely on topic, like in that other post (unicodeFFFE... is Microsoft off its rocker?).

But I feel I must ask one question.

Why are people writing their UNIX Shell scripts in Notepad such that the issue of Notepad saving the BOM in UTF-8 is such an issue?

I mean, people who are writing UNIX shell scripts are not guaranteed to be among the Microsoft haters, but all things be equal they are probably more likely to be than the people who pay their own fees to go to TechEd or PDC.

So why are they writing their UNIX shell scripts in Windows Notepad, exactly?

I'd just like it if someone could explain this one. It just makes no sense to me....

 

This post brought to you by U+fffe, a permanently reserved code unit in Unicode so that BOM determination can remain easier....


# DJ on 11 Mar 2008 9:51 AM:

This one is simple... a whole lot of folks writing UNIX shell scripts are interfacing with UNIX servers via MS Windows workstations. Rather than use VI/Brief/EMacs etc. they write the scripts using notepad and upload them to the server.  Been there, done that...

DJ

# Cesar Eduardo Barros on 11 Mar 2008 9:58 AM:

It's not just shell scripts, and it's not always notepad.

I recently had to deal with PHP files which had been edited in a text editor which added the so-called UTF-8 BOM to them. PHP is quite transparent, so it happily output the BOM -- and later the script tried to set a header, which was not possible since the output had already started (you can only set the headers before the first output byte, unless you enable output buffering). That particular PHP script needed a specific header value to be output, so it stopped working.

Shell files are just a particularly troublesome instance of the problem (not only does the kernel look for the magic value on the first two *bytes* of the file, but also a stray \r before the final \n on the line is included in the command line -- often causing strange error messages).

The main cause of the problem is that the so-called UTF-8 BOM breaks the very useful property of UTF-8 that, if your text can be represented as pure 7-bit ASCII, its UTF-8 representation will be bytewise identical. A lot of Unix tools depend on that property (which is also true for several other character encodings).

# Michael S. Kaplan on 11 Mar 2008 10:12 AM:

The reality is that the UTF-8 BOM (which is not "so-called" -- it IS, and is described) exists and is the only way to distinguish UTF-8 from ASCII -- so if one does not like it, one should  use another editor?

Or another OS that deals with things as they are instead of things as they were? :-)

# Cesar Eduardo Barros on 11 Mar 2008 10:44 AM:

Unless the text file only has 7-bit characters (in which case it makes no difference), it's very easy to distinguish UTF-8 from ASCII: UTF-8 has bytes with the eighth bit set :-)

Most Unix tools are "eigth-bit transparent": they don't care about which character encoding you are using, they simply pass most bytes unchanged (the exception is the byte values they care about, which are almost always in the ASCII range; for a trivial example, the filesystem and filesystem-related tools only care about '\0', '/', and '.'). This is how they were able to use an encoding (UTF-8) which didn't exist when they were designed, as long as the encoding is ASCII-compatible (UTF-8 was designed to be ASCII-compatible from the beginning).

To these tools, the UTF-8 BOM is just another charset-specific sequence of characters to be passed unchanged. This breaks when the context doesn't accept any extra character (like on the first line of a shell script), or when the presence of a non-whitespace (ASCII whitespace, that is) character makes a difference (the PHP example is one case where any character, even an ASCII whitespace character, would break). The UTF-8 BOM, being invisible "junk", is not noticed by the person editing the file, but is noticed by these programs, causing the breakage.

Unix tools don't add a UTF-8 BOM, and either are charset-agnostic (this is the case with the shell and AFAIK also with PHP), or use the current encoding (LC_CTYPE, nowadays UTF-8 by default on most distributions), or are able to autodetect defaulting to the current encoding (this is the case mostly for text editors, like vim). It's the charset-agnostic ones who break with the UTF-8 BOM (they interpret it as valid data, not some odd sort of embedded metadata).

# Michael S. Kaplan on 11 Mar 2008 11:07 AM:

    Unless the text file only has 7-bit characters (in which case it makes no difference)

Actually, it does.

If the user says that they want to save a file as UTF-8, then it makes sense to remember that fact. It is much friendlier than forgetting what they did!

If you want to keep things in ASCII, that works too - just don't save it as UTF-8 and it works just fine. :-)

It takes an overt act to break the shell script -- you have to explicitly choose an encoding that will  do so....

# Andrew Cook on 11 Mar 2008 11:17 AM:

Do the Microsoft Interix (Services for Unix for <Vista, something else for Vista+) tools play nice with the BOM?

# Michael S. Kaplan on 11 Mar 2008 11:21 AM:

Good question -- I am not sure (I have only ever had them installed briefly, for the irony of the name and to look at the code pages added thereby).

# John Cowan on 11 Mar 2008 11:53 AM:

I couldn't disagree with you more on this one, Michael.

It's true that the 8-BOM is now part of the standard, and it was even necessary for the XML Core WG to backpatch XML 1.0 to accept 8-BOMs after it became clear that they *would* appear in XML files, will we nill we.  But it's still a gratuitous incompatibility between full UTF-8 applications and applications that can simply be ASCII-aware as long as they are 8-bit clean.

In the research OS "Plan 9 from Bell Labs", for which UTF-8 was actually designed, there were 170 command-line programs packaged with it at the time of the conversion to UTF-8, from simple utilities to compilers and interpreters.  Only 23 of these needed to be made UTF-8 aware in the sense above; the rest could treat all their string inputs and outputs as 8-bit vectors or streams, assuming only ASCII.

Here's a nice quote from Rob Pike and Ken Thompson's paper on the UTF-8 conversion:

The Unicode Standard [as it then was] defines an adequate character set but an unreasonable representation [UCS-2]. It states that all characters are 16 bits wide and are communicated and stored in 16-bit units. It also reserves a pair of characters (hexadecimal FFFE and FEFF) to detect byte order in transmitted text, requiring state in the byte stream. (The Unicode Consortium was thinking of files, not pipes.) To adopt this encoding, we would have had to convert all text going into and out of Plan 9 between ASCII and Unicode, which cannot be done. Within a single program, in command of all its input and output, it is possible to define characters as 16-bit quantities; in the context of a networked system with hundreds of applications on diverse machines [that is, using diverse operating systems] by different manufacturers, it is impossible.

Cesar: actually, Posix filesystems don't care about '.' at all.  The specific *names* "." and ".." are reserved, that's all.  In fact, although we all think of filenames as strings, use them as strings, refer to them with strings, the truth is that Posix filenames are byte vectors with content restrictions, and Windows filenames are 16-bit-code-unit vectors with different content restrictions.

# Maurits [MSFT] on 11 Mar 2008 12:00 PM:

It shouldn't be /that/ hard to get (UNIX shell of your choice) to be BOM-aware.

And I must admit that I've written my share of UNIX-intended text files in Notepad.

# Michael S. Kaplan on 11 Mar 2008 12:22 PM:

    The Unicode Consortium was thinking of files, not pipes.

Actually, with the UTF-8 BOM, they are still clearly thinking about files. As is Notepad -- so they all have something in common!

Though perhaps a note could be added to the help file to explain that Notepad is not "pipe-safe" ? :-)

# John Cowan on 11 Mar 2008 12:43 PM:

Maurits: The point is that plain text in Unix (using the term "Unix" generically, of course) is a universal representation: absent a compelling reason to do otherwise, everything is represented as text.  So it's not about fixing one particular shell: it's about making *every* program encoding-aware even when it's completely unnecessary.

Disclaimer: When I'm stuck with using Windows, I install Cygwin first thing and then live in it as much as possible.  I also added BOM-stripping to the text conversion utility (dos2unix) that changes CR+LF pairs to just LFs.

# Michael S. Kaplan on 11 Mar 2008 3:41 PM:

Well, we take this all another way -- if you are moving to another platform, then there are a bunch of things you have to change, to fit in well with that platform -- from CRLF -> LF conversions to UTF-8 BOM prefix stripping to Unicode normalization for tools that do not understand canonical equivalence, and so on. If you are not willing to do these things, then it is a self-imposed bug in the process of the person working cross-platform without being willing to understand the full requirements of doing so. :-)

With that said, I have coded the change to add a "BOM-less UTF-8" save option three times over the last seven years, and the option for "CR-less new lines" twice, each time forwarding to the owners of Notepad at the time; in every case the code was not integrated into the product as neither change targets a core scenario for NOTEPAD.EXE that was of significant importance to merit the test, UA/UE, localization, and servicing costs thereof....

# Maurits [MSFT] on 11 Mar 2008 4:37 PM:

    > everything is represented as text

Yeah, but BOM is precisely intended to disambiguate text.

    > it's about making *every* program encoding-aware even when it's completely unnecessary.

The task of making *every* program encoding-aware is not as complicated as you imply.  It probably suffices to make a few file-reading libraries encoding-aware, at least in the 99% case of ASCII-only text files.

# Cesar Eduardo Barros on 11 Mar 2008 7:54 PM:

John Cowan:

I mentioned filesystem and filesystem-related tools; ls, for instance, hides files starting with a '.' by default, and several programs use file extensions (separated by a '.') to guess the file type when not told otherwise. This together with the two special directory entries is enough to make '.' also a significant character (together with '/' and '\0').

Maurits:

> It probably suffices to make a few file-reading libraries encoding-aware, at least in the 99% case of ASCII-only text files.

I'd say in 99% of the problematic cases the file-reading library is either the C library's stdio or the POSIX lower-level functions (open(), read(), write(), close(), ...). They are used both for text files and for binary files (which must not be converted). For fopen(), there's a mode flag, but SUSv3 says it "shall have no effect". For open(), there's no mode flag at all (O_BINARY and O_TEXT seem to be a Microsoft extension).

After you add a mode flag to open() (and either change the kernel or make open() no longer be a thin wrapper around the system call), you still have to chose either text or binary mode on each file-opening call of each program which reads or writes a file. Sometimes the program cannot determine by itself, and would need new command line switches (for instance, consider cat(1) being used to concatenate two files. If the second file is a text file it should strip the UTF-8 BOM from it; if the second file is not a text file, it must pass the data unchanged, even if it looks like a BOM). For backwards compability and to avoid accidental data corruption, these switches would all have to default to "binary" mode. Since fopen() defaults to text-mode, every program must be audited to add the binary-mode flag unless it really wants to use text mode (the flag isn't required on SUSv3, since it makes no difference). In the end, it's as much work as making every program fully encoding-aware.

There's also the question of what to do if a file is *both* a text file and a binary file. These do exist; the Sun Java self-extracting installer is an example (a shell script concatenated with an ELF binary). For these, opening as a text file would be wrong, but they cannot easily be identified (and if you open all shell scripts as text and try to convert the text, scripts like that Java installer will stop working).

All this for a problem which wouldn't exist if the so-called UTF-8 BOM didn't exist (why is it called a "byte order mark" if it isn't marking any byte order?). Different unicode normalization (or lack of it) or even extraneous CR characters (which most programs pass unchanged or discard; the kernel script loader is one exception) don't cause so many problems.

Michael:

Back to the original question, the reason shell scripts are ever edited on notepad is that Windows is too common (meaning even hardcore Unix users end up having to use it sometimes), and the only editors guaranteed to be on a Windows machine are Notepad and Wordpad, and I'm not sure about the later.

This is not about hating Microsoft; is about hating one particular bad technical decision (the UTF-8 BOM, in this case), and the way it spills over unrelated systems.

# Michael S. Kaplan on 12 Mar 2008 2:06 AM:

Sorry, I can't agree here. Notepad is supporting the scenario it needs, and not supporting a scenario it never agreed to -- so it behaves as designed to the betterment of those it was designed for.

The UNIX shell script scenario on Windows? The "text file and a binary file" being created in Notepad? Way out of scope here....

# Dave Amer on 12 Mar 2008 7:24 AM:

I dont under stand where the problem comes from.

If a program is unicode aware then the only time it takes any notice of the BOM is when it is trying to workout how the string was stored on disk. If the program is not unicode aware why did you save the file as UTF8?

Any program that doesnt care about encoding will not even notice the BOM anyway since is is simply passing the constents of a file around.

If you wanted a plain text file why did you ask for a UTF8 encoded one? To me it sound more like a PICNIC (Problem In Chair Not In Computer) type problem than anything else.

Like i sad at the start, i may have missed somthing, but it wouldnt supprise me if its just people bitching at microsft for their own mistakes.

Saying that somthing is UTF8 compatible so long as the first few bytes arn't UTF8 is basicly lying. What should be said is "it works but its a bit of a bodge, and breaks unless the data has been specifly crafted in such a way that the bodge doesnt fail. So remeber to keep the bodge in mind.... or else!"

# Michael S. Kaplan on 12 Mar 2008 7:35 AM:

Actually, Unix folks do complain about this all the time; it isn't Microsoft people.

But I think you are right, and it is still requiring an overt act (on the part of the person changing the encoding in the Save As dialog), and it is therefore pilot error....

# Dave Amer on 12 Mar 2008 7:56 AM:

I think in summery my point could be expressed as, it is not utf8 compatible if it cant hande all legal utf8 strings.

If it cant then fix the app rather than blaming the perfectly legal data :). If that means that you need to maintain somthing that has been hobbling along for decades, then so be it.

Also, when i said "Bitching at microsoft" i probably should have said "Bitching to microsoft" or "Bitching about microsoft" i see where the confusion came from and was trying to say that it seems people are all to ready to blame microsoft for their own mistakes (it appars this is especially the case in the Open source and Apple comunitys), my fault and for that i appologise :).

# Michael S. Kaplan on 12 Mar 2008 8:14 AM:

Agree++ and no worries -- it is in fact the reason I wrote this [very direct ]post. :-)

# Mihai on 12 Mar 2008 10:04 AM:

>>Actually, Unix folks do complain about this all the time; it isn't Microsoft people.

I am not a Unix guy, but it still bothers me that a there is no plain text editor in Windows. Notepad comes close, but then it is trying to play smart by adding a BOM to something that did not have one.

I hate applications that think they know better than me and I would prefer a "pipe safe" behavior.

# Dave Amer on 12 Mar 2008 11:52 AM:

If you wanted plain text you should not have asked for UTF8. UTF8 is not plain text, simple as that.

What you eneded up with is perfectly valid, explictly marked, UTF8. Exactly what you asked for. Unfortunaly or probably fortunatly* computers cant read minds at this stage of their development.

*Presumably if they could read minds they would probably be capable of independant thought. I have a few ideas about what a computer might think about people who ask for one thing but want another and blame the computer for what they end up with. I sugest you try asking for somthing in a fast food restaurant and then complaining that deep down you really wanted somthing else even though you didnt say so. I also sugest the results of such an experiment should be posted online somewhere for everyone to see (with photos of the contents of your replacement order of corse). :)

# Günther on 12 Mar 2008 3:40 PM:

One of the most compelling features of UTF-8 is backwards compatibility to old programms. The additional character at the start throws that away. Suddenly, UTF-8 is no better than UTF-16 or UTF-32 or something. So why have the option at all? To save some space? I don't think notepad is designed for really big files for which the difference matters.

That said, at the time I still used Windows enough to get into the situation of having to edit an UTF-8 file, I had SciTE installed and did not use notepad. But I probably will have to add BOM-stripping to a game I work on because a lot of the game's content authors use notepad...

# Mihai on 13 Mar 2008 10:56 AM:

<<If you wanted plain text you should not have asked for UTF8. UTF8 is not plain text, simple as that.>>

Unicode (and utf-8) is all about plain text (http://unicode.org/glossary/#plain_text)

Or, to quote someone else <<All that stuff about "plain text = ascii = characters are 8 bits" is not only wrong, it's hopelessly wrong, and if you're still programming that way, you're not much better than a medical doctor who doesn't believe in germs.>>

(http://www.joelonsoftware.com/articles/Unicode.html)

# Scott Atwood on 13 Mar 2008 2:36 PM:

I agree with Mihai.  UTF-8 is plaintext.   Microsoft's use of BOM in Notepad is just a very clever hack to deal with the problem that without external metadata, it can be difficult or impossible to identify which plaintext encoding you mean.  In the case of Notepad, they simply put the encoding metadata in the file itself it it is any flavor of Unicode.

# Mihai on 14 Mar 2008 4:41 AM:

I agree the BOM in Notepad is ok sometimes.

If I start from scratch, or from UTF-16LE/BE, or from an ANSI file. And I say "Save as UTF-8" Fine, do whatever.

But if I open a file, no BOM, Notepad thinks is UTF-8, change one character and save, I don't want Notepad adding the BOM.

The BOM was not there, Notepad was able to detect utf-8, so why mess with it?

Seth on 22 Feb 2010 12:26 PM:

This is an old post, but I still want to comment:

"the UTF-8 BOM [...] is the only way to distinguish UTF-8 from ASCII"

Distinguishing between the two is pointless. If a program understands UTF-8 then it doesn't need any special ASCII mode and therefore doesn't need to distinguish. If a program does not understand UTF-8 then it's not able to understand the BOM and therefore isn't able to distinguish. The program would have to be taught at least enough about UTF-8 to read the BOM. However, most of such programs are legacy and won't be updated to take advantage of a distinguishing mark.

Michael S. Kaplan on 22 Feb 2010 12:40 PM:

Um, as a Notepad feature, it is NOT pointless. Perhaps people could less UNIX shell script authoring in Notepad? :-)

Seth on 22 Feb 2010 7:14 PM:

Even just as a Notepad feature UTF-8 BOM seems dubious at best. The argument seems to be that the UTF-8 BOM is helpful in preventing users from being confused when opening files previously saved as UTF-8, but which contain only ASCII characters, and they see the file marked as being a legacy encoding. This reasoning seems specious since in retrospect it doesn't seem to have held on any of the other platforms. It'd be interesting to know if this reasoning originated in some programmers gut or if Microsoft actually had significant real-world data at the time they were implementing UTF-8 in Notepad.

Then there are the downsides. The feature inevitably escapes the domain where acts as a Notepad feature. This feature manufactures tricky questions like when to preserve/not preserve a BOM found in a byte stream. Standards have to be redefined, as John Cowan mentions of XML. Far more time has been spent arguing about UTF-8 BOMs than ever would have been spent by users confused over why they have to re-select the encoding for their files. Notepad fails to meet the needs of what is apparently, according to your comments, one of its major customer demographics. ; )

Michael S. Kaplan on 22 Feb 2010 7:55 PM:

Actually, it is used by Visual Studio, the C/C++ compiler, and many other apps that want to make it easier to detect without having to look ahead in the file.

There are even apps like FrontPage which can handle either.

Almost all apps outside of the niche Unix scenario work just fine, so they should STOP USING NOTEPAD!!!

Seth on 23 Feb 2010 9:03 AM:

I think the same reasoning I applied to Notepad also applies to other applications. And even if it didn't hold, and UTF-8 BOM was actually useful in some circumstances, that doesn't answer any of the other criticisms.

"Almost all apps [...] work just fine,"

Arguing that UTF-8 BOM breaks almost no apps doesn't inspire confidence that this was the right decision back when it was made.

Michael S. Kaplan on 23 Feb 2010 9:12 AM:

I don't have to answer every point because it is a 10-year-old argument that you have lost. :-)

See the latest blog put there today. It is time to move on.

AND STOP USING NOTEPAD.

I do not understand why anyone would use an application with so many potential replacements that they believe has been broken for 10 years.

It IS the right solution....just not for you.

Move on. Please....

Apphacker on 31 Mar 2010 7:32 PM:

Wow, someone really messed up when they decided to have Notepad add a BOM to the beginning of a file...

Watches on 31 Mar 2010 9:56 PM:

Notepad should be phased out of windows


referenced by

2010/08/14 (It wasn't me)

2010/02/23 The game is over, people!

go to newer or older post, or back to index or month or day