Behind 'How to break Windows Notepad'

by Michael S. Kaplan, published on 2006/06/14 11:47 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/06/14/631016.aspx


Larry Osterman pointed me at an article entitled How to break Windows Notepad that makes for an interesting experiment:

Here's how to do it:
1. Open up Notepad (not Wordpad, not Word or any other
word processor)
2. Type in this sentence exactly (without quotes): "this app can break"
3. Save the file to your
hard drive.
4. Close Notepad
5. Open the saved file by double clicking it.

Instead of seeing your sentence, you should see a series of squares. For whatever reason, Notepad can't figure out what to do with that series of characters and breaks

Now if you have East Asian language support installed, instead of seeing squares (NULL glyphs), you will see:

桴獩愠灰挠湡戠敲歡

An if you look at the code points under those characters, you will likely see what happened:

6874 7369 6120 7070 6320 6e61 6220 6572 6b61

Ah, each byte is a letter that when combined just so happens to line up with a CJK ideograph!

I have talked about the encoding detection mechanisms that notepad uses recently, and this is another example of the problem, one that is more fun since the repro steps are so much fun (in fact the only improvement would be text insulting Microsoft or one of its rivals, which notepad appears to censor in an example of a big bad monopoly, etc.!).

Now I have pointed out that I do not like the IsTextUnicode function in the past, and I suppose this could be considered a good reason (IsTextUnicode returns TRUE here, which is why Notepad guesses as it does).

 

This post brought to you by (U+6874, a CJK ideograph)


# TP on 14 Jun 2006 12:07 PM:

Cool post man. I like it...

# Lionel Fourquaux on 14 Jun 2006 12:47 PM:

Maybe Notepad should offer an option to bypass encoding detection?

# Maurits on 14 Jun 2006 3:30 PM:

select cast(0x687473696120707063206e61622065726b61 as varchar)

htsia ppc nab erka

# Maurits on 14 Jun 2006 3:34 PM:

Note that the lengths of the words are reversed:

Original: 4 3 3 5
Changed: 5 3 3 4

Ignoring spaces, note that the characters within each word are scrambled, but the order of the words themselves remain the same

thisappcanbreak
htsiappcnaberka

A question to ponder... did the two "p"s switch places?

# Michael S. Kaplan on 14 Jun 2006 3:47 PM:

That's just endian-ness, I have talked about that before. :-)

# Michael S. Kaplan on 14 Jun 2006 3:49 PM:

Hi Lionel -- well, I'd personally prefer if they added "BOM-free UTF-8 supoort" as a save option prior to "no detection" as a load option. :-)

# Maurits [MSFT] on 14 Jun 2006 3:52 PM:

> did the two "ps" switch places?

Yup, I think they did.  The scrambling is simple:

Scramble each word individually

To scramble a word, start at the end.
Switch the last two letters of the word.
Switch the previous two letters of the word.
Keep switching letter pairs until you have either scrambled the whole word (if there were an even number of letters) or there's a single letter left.

# Maurits [MSFT] on 14 Jun 2006 3:54 PM:

> That's just endian-ness

Oh, duh.

Switch byte-pairs:

select cast(0x74686973206170702063616e20627265616b as varchar(20))

this app can break

# Maurits [MSFT] on 14 Jun 2006 4:04 PM:

This still leaves open the question of why IsTextUnicode("this app can break") == TRUE -- looks like ASCII to me.  Maybe some of the component tests will reveal a clue.

# Maurits [MSFT] on 14 Jun 2006 5:12 PM:

Huh... I don't have a IS_TEXT_UNICODE_BUFFER_TOO_SMALL test on my W2K system.  Do I have to include something special to get it?  I've got all the others.

Barring that, the only tests that fire for that string are:
IS_TEXT_UNICODE_STATISTICS
IS_TEXT_UNICODE_UNICODE_MASK

# Maurits [MSFT] on 14 Jun 2006 6:30 PM:

Looks like wine couldn't find the IS_TEXT_BUFFER_TOO_SMALL test either!

http://cvs.winehq.org/patch.py?id=17837
/* FIXME: MSDN documents IS_TEXT_UNICODE_BUFFER_TOO_SMALL but there is no such thing... */

Those wacky docs guys, always making up flags ;)

# Dean Harding on 14 Jun 2006 7:33 PM:

This will never be a fixable problem...

If you made IsTextUnicode("this app can break") return FALSE, then you'd just have some Chinese guy saying "when I type '桴獩愠灰挠湡戠敲歡' into notepad, save it and reopen it, it just displays some funny English characters!"

Actually, I think it would save with a BOM in that case, so it probably wouldn't do that. But you get the idea :-)

# Maurits [MSFT] on 14 Jun 2006 8:02 PM:

I think I know why IS_TEXT_UNICODE_BUFFER_TOO_SMALL is missing.

Looking at winnt.h, the four "masks" are defined as 0x000f, 0x00f0, 0x0f00, and 0xf000.  There seems to be an unwillingness to break the 17th bit for some reason; and IS_TEXT_UNICODE_BUFFER_TOO_SMALL doesn't fit any of the masks; so it was dropped.  But it's still in the documentation, which is a documentation error.

Michael S. Kaplan on 15 Jun 2006 12:11 AM:

Michael S. Kaplan on 15 Jun 2006 12:14 AM:

Michael S. Kaplan on 15 Jun 2006 12:19 AM:

Michael S. Kaplan on 15 Jun 2006 12:20 AM:

# Pavanaja U B on 15 Jun 2006 2:30 AM:

When I save the file as Unicode, the problem disappears, as expected !?

Regards,
Pavanaja

# Michael S. Kaplan on 15 Jun 2006 9:15 AM:

Certainly, Pavanaja -- same results if you save as UTF-8 or UTF-16 (Big Endian) -- all three of those formats are unambiguous due when saved in Notepad due to the BOM being there, so there is no confusion....

# dragonfrog on 15 Jun 2006 11:48 AM:

A possible improvement - it's not insulting a software firm, but "Bush hid the facts" has the same effect as "this app can break"

# borky on 15 Jun 2006 12:55 PM:

Also works for the following string:

Bush hid the facts

# q^-o|o-^p on 17 Jun 2006 8:38 PM:

Here's a good one -- use the following line:

We can blast Microsoft for a new bug

# A fish called blue on 19 Jun 2006 3:28 AM:

A colleuge noted that NOTEPAD.EXE is not the only affected application. MORE is also affected (but not TYPE) , possibly also other commands and tools in windows.

# Rajesh Shenoy on 19 Jun 2006 10:24 AM:

Any string with 4-3-3-5 letters in the words does it.

# east on 19 Jun 2006 4:18 PM:

ds

# grsws on 19 Jun 2006 4:19 PM:

Bush hid the facts

# Anuj on 22 Jun 2006 10:43 PM:

It happens with any string of characters of the form
aaaa aaa aaa aaaaa

# Mircea on 28 Jun 2006 7:56 PM:

Put this in a notepad  "muie fut cur maine"  widowth " "  

# Skews Me on 3 Jul 2006 3:56 PM:

I RECOMMEND NOT TRYING TO REPRODUCE THIS BUG.

Ever since I tried the "bush hid the facts" Easter Egg, Notepad has been having trouble with some of my html files. Some of the problems disappear if I restart Notepad and reload the file, but there are still other problems that appear and can be reproduced.

# Michael S. Kaplan on 3 Jul 2006 4:53 PM:

I recommend not placing too much stock in the recommendation by Skews Me -- there are no long term consequences to this issue....

# Sanjay Vyas on 11 Jul 2006 6:23 AM:

Not all combinations of 4-3-3-5 will produce it. For example, Bush hid the truth does not work while Bush hid the facts does. Any explanation?

# proxy on 2 Aug 2006 12:57 PM:

"We can blast Microsoft for a new bug" is nice one
You can find your own.
Read
http://dhilung.blogspot.com/2006/08/technotepad-facts-behind-bush-hid.html

# South Korean Man on 10 Nov 2007 7:10 AM:

Wow I`m from south korean

IT`S SurpRISE

because Internet Chat Can

OKAY?

I`M 15 YEARS OLD

NAME:Kim Dong UK

LIVE:SEOUL

VERY THANK YOU

# Morbo on 3 Dec 2007 10:45 AM:

A different one:

Now if you type a newline, all the CrLFs are rendered as square blocks.

# Erzengel on 24 Mar 2008 7:12 PM:

I can't repro in Vista. Fixed?

# Michael S. Kaplan on 25 Mar 2008 3:01 AM:

Funny you should ask. I'll bet you have Sp1 installed! :-)

Sameera R. on 19 Apr 2010 2:38 AM:

It doesn't happen in Win7.

Check it.

Michael S. Kaplan on 19 Apr 2010 6:59 AM:

Check what?

If you read the comment just before yours, it points to a blog I wrote that explains how and where this was "fixed". Perhaps I should suggest you check *that*? :-)


referenced by

2010/08/14 (It wasn't me)

2010/04/01 Is the text in XKCD broken?

2008/03/25 Bush might've still hid the facts, but he can't hide them from Vista SP1/Server 2008 Notepad!

2008/03/24 Unicode not being the default is slower and leads to bugs; maybe it ought to change?

2007/12/11 How to get yourself imprisoned [by/for talking about Unicode]

2007/04/22 The Notepad encoding detection issues keep coming up

2006/12/23 Do not adjust your browser, a.k.a. sometimes two wrongs DO make a right, a.k.a. dumb quotes

2006/08/02 Hang on just a [Hansel]Minute!

2006/07/11 More on that which breaks Windows Notepad

2006/07/04 Behind Norman's 'Who needs Unicode?' post

2006/06/22 Things I [don't] like about blogging

go to newer or older post, or back to index or month or day