(It wasn't me)

by Michael S. Kaplan, published on 2010/08/14 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/08/14/10050078.aspx


(Excuse the Shaggy reference!)

It wasn't me.

Well, this time it wasn't me.

I mean, yes, it was me in Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!), back in 2005.

And yes again, it was me in Why are UTF-8 encoded Unix shell scripts *ever* written or edited in Notepad?, back in 2008.

And yes, it was me yet again in The game is over, people!, earlier this year.

It is all about that UTF-8 BOM thing that so many people hate.

But Murry (this Murray) told me, near the end of the Unicode Technical Committee meeting, that I had my facts wrong about the UTF-8 detection algorithm added in Windows 2000 that is at the heart of Behind 'How to break Windows Notepad'.

According to him, RichEdit was first to do that sort of thing. Back in 1998.

And then Notepad followed suit soon after.

Now the Notepad code has at least one advantages over the RichEdit code -- as The Notepad encoding detection issues keep coming upmentions in its Step 3, Notepad attempts to detect BOM-less UTF-8, which RichEdit does not do. Though unlike the Notepad code, its detection algorithm is not based on the 1998 RFC, so RichEdit may get to avoid turning supplementary character-based CESU-8 into a big bunch of U+fffd cottage cheese, instead just not detecting it as UTF-8 when it has no BOM.

So there is good and bad in there, one can pick one's poison.

Either way, Murray was quite clear on one point: he believed the UTF-8 BOM was not a controversial issue and that it was a documented, expected, and reasonable way to tag UTF-8 text.

So if anyone has not given up and still is unhappy about this, they know it isn't me -- I have never owned Notepad, never changed its UTF-8 detection or conversion code. While at the same time Murray has been in the RichEdit code longer than I have been at Microsoft!

It wasn't me. :-)


Mihai on 16 Aug 2010 12:21 PM:

First: I totally understand the "The game is over, people!" :-)

And I think at this point very few people (if any) argue if BOM is a "reasonable way to tag UTF-8 text."

The main complaint is this: if you open a UTF-8 file without BOM, then don't add it.

But I also think that as a developer (my mom does not edit Linux scripts) one should be mature enough to choose the tools he likes. If you don't like Notepad, use something else and stop the BMW (as is "Bitching, Moaning, and Whining" :-)


referenced by

2012/01/23 You can do CESU-8 if you need to; we went in a slightly different direction....

go to newer or older post, or back to index or month or day