The Notepad encoding detection issues keep coming up

by Michael S. Kaplan, published on 2007/04/22 21:59 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/04/22/2239345.aspx


A few days ago, Raymond was talking about the Notepad file encoding problem, again. And the comments were pretty funny, like watching a traffic accident as people started going off the rails in all kinds of directions.

For the record, here is the official, UNDOCUMENTED, Notepad encoding detection story, only mildly changed between Windows 2000 Beta 2 through now (into Longhorn Server thast hasn't shipped yet):

  1. Check the first two bytes;
    1. If there is a UTF-16 LE BOM, then treat it (and load it) as a "Unicode" file;
    2. If there is a UTF-16 BE BOM, then treat it (and load it) as a "Unicode (Big Endian)" file;
    3. If the first two bytes look like the start of a UTF-8 BOM, then check the next byte and if we have a UTF-8 BOM, then treat it (and load it) as a "UTF-8" file;
  2. Check with IsTextUnicode to see if that function think it is BOM-less UTF-16 LE, if so, then treat it (and load it) as a "Unicode" file;
  3. Check to see if it UTF-8 using the original RFC 2279 definition  from 1998 and if it then treat it (and load it) as a "UTF-8" file;
  4. Assume an ANSI file using the default system code page of the machine.

Now note that there are some holes here, like the fact that step 2 does not do quite as good with BOM-less UTF-16 BE (there may even be a bug here, I'm not sure -- if so it's a bug in Notepad beyond any bug in IsTextUnicode).

And frankly if people were happy with the IsTextUnicode behavior in general or with small files in particular then the big hub-hub I mentioned here and here wouldn't have been such a mini-phenomenon (like as if people needed Notepad to comment on whether Bush hid the facts or not!).

But then again I already mentioned I don't like IsTextUnicode, for roughly some the same reasons that the whole Notepad "detection" thing is a pain.

I also don't like step 3 above, either -- the code may be fast but it also is way behind the current algorithm used by MultiByteToWideChar, which has one a pretty good job keeping up with the ever changing Unicode conformance guidelines. I still haven't gotten my head around what it means for a file that meets the 1998 guidelines but not the latest UTF-8 conformance rules. Probably a lot of U+FFFD characters in the future, UTF-8 style (EF BF BD).

But in the end I think it is unfair to pick on Notepad here. IsTextUnicode needs to be updated as I said over two years ago here and then after that is done someone needs to go update Notepad to use the new detection stuff that is added.

In the meantime folks should not be so busy complaining about stuff before they understand it; as the above makes clear there is plenty of material to complain about accurately, later. :-)

 

This post brought to you by (U+fffd, a.k.a. REPLACEMENT CHARACTER)


# Brian on 23 Apr 2007 4:54 AM:

I would really like to see the Unicode names that Notepad uses clarified in an future version. Particularly confusing is "Unicode". At the very least, I suggest renaming this to "Unicode (Little Endian)"; however, "UTF-16 LE" would be better. Same for "Unicode big endian". UTF-16 BE or UTF-32 BE? I know the answer, but not from looking at that dialog. It already lists "UTF-8", so it would make things more consistent as well. These are very minor changes and should not affect any code.

# Ben Bryant on 23 Apr 2007 7:27 AM:

Do you know if step 3 is done against the whole file or the first 256 chars or something like that?

# Michael S. Kaplan on 23 Apr 2007 9:17 AM:

Hi Ben,

I believe step #3 is done against the entire file, which is not what happens with step #2 since the function itself limits how many bytes it looks at....

CORRECTION -- I looked at the code a little closer -- it reads the first 1024 bytes to try to make the determination.

# Michael S. Kaplan on 23 Apr 2007 9:21 AM:

Hi Brian,

Although this would be better for people who understand things about Unicode, it is actually worse, and less understandable, for the majority of people. And its not like the first group can't easily see what each one means just by process of elimination (or by looking in help!), which makes change even less likely.

Shailesh on 7 Jun 2010 7:41 AM:

I couldnt understand you statement related toEF BF BD.

Post .net 2.0, I have seen that StreamReader converts all invalid characters into EF BF BD sequence. (I couldnt find any official documention about this though)

Are you talking about this?


referenced by

2010/08/20 The song^H^H^H^Hbug remains the same

2010/08/14 (It wasn't me)

2008/03/25 Bush might've still hid the facts, but he can't hide them from Vista SP1/Server 2008 Notepad!

go to newer or older post, or back to index or month or day