More on that which breaks Windows Notepad

by Michael S. Kaplan, published on 2006/07/11 10:26 -04:00, original URI:

In the latest comment to the post that keeps on going (Behind 'How to break Windows Notepad'), Sanjay Vyas asks:

Not all combinations of 4-3-3-5 will produce it. For example, Bush hid the truth does not work while Bush hid the facts does. Any explanation?

Well, let us look at the two strings. First:

Bush hid the facts
0042 0075 0073 0068 0020 0068 0069 0064 0020 0074 0068 0065 0020 0066 0061 0063 0074 0073


7542 6873 6820 6469 7420 6568 6620 6361 7374

The obvious question is why the second string does not do the same thing. Why does:

Bush hid the truth
0042 0075 0073 0068 0020 0068 0069 0064 0020 0074 0068 0065 0020 0074 0072 0075 0074 0068

not become the analagous:

7542 6873 6820 6469 7420 6568 7420 7572 6874


Neither string is very useful from a meaning standpoint, so we can dispell conspiracy theories involving both Japan and China right away (thankfully!).

If you run both bits of text through IsTextUnicode running all tests, the first one returns TRUE and only returns IS_TEXT_UNICODE_STATISTICS, which means it only won the statistical tests.

In a comment to the digg thread, neko asked (but no one answered):

I wonder if this will give the same result if you run notepad.exe in wine? Does wine emulate the dodgy isTextUnicode() behaviour as well?

I am not sure, though I am inclined to doubt that they are running the same statistical tests. A bit of Google spluenking suggests that the code is here  somewhere, but I decided not to look myself. Hopefully it is not a dead link. Someone can tell me later if I was right. :-)

I won't post the actual code for IsTextUnicode -- no sense getting in thast kind of trouble, and even if that did not matter it is kind of embarrassing code....

(As a side note, the results on Windows vary a little bit depending on the default system locale as the function looks a bit at lead bytes -- and if the ratio of bytes in the string to lead bytes according to a DBCS default system code page is 2:1 -- which could in theory mean that on a Chinese, Japanese, or Korean system that the results could vary some....)

The tests (on Windows) are rather arbitrary and consist of a few parts, but the biggest piece it is really testing is a comparison of the fluctuation between high bytes and low bytes, and the diff between various high bytes and low bytes that the second string is failing on.

I honestly see no good reason for it to return TRUE here in either string, though I ran into problems even trying to fix this bug with CRLF due to it breaking a use of the function in detecting a Unicode JScript file, so changing the tests that the function does here is probably a no-no.

Though it really is not a conspiracy with Microsoft making critical comments on the president's use of TRUTH or FACTS, I doubt I will have much luck convincing people of that.

To me the more interesting conclusion here is that the passing of an arbitrary Unicode String like "畂桳栠摩琠敨琠畲桴" to IsTextUnicode returns FALSE due to these statistical tests. Once again, this is just not a function I like (or trust!) very much.

But it is not some kind of easter egg. Truly, it is just a dumb algorithm!


This post brought to you by (U+6874, a CJK ideograph)

# Dean Harding on 11 Jul 2006 7:53 PM:

Just goes to show... there are lies, damned lies and statistics.

# Michael S. Kaplan on 11 Jul 2006 10:19 PM:

Hmmm... new version:

there are lies, damned lies, statistics, and there is the IsTextUnicode function.


# Michael Dunn_ on 11 Jul 2006 11:21 PM:

Maybe this calls for an IsTextReallyUnicode API, with better algorithms? ;)  (in the tradition of RealDriveType and RealChildWindowFromPoint)

# Michael S. Kaplan on 12 Jul 2006 1:04 AM:

Hi Mike --

Actually, we have a whole WORD left for new flags that could be added (look at the old IsTextUnicode post for a whole bunch of expansion ideas!)....

# Maximilian Haru Raditya on 12 Jul 2006 6:29 AM:

So, I'd like to ask:

has this bug already been fixed?

# Michael S. Kaplan on 12 Jul 2006 8:28 AM:

Hi Maximillian --

No, it has not (in fact the one bug I really tried to fix for Malayalam in IsTextUnicode had to have its fix backed out!).

But although it is a bit annoying, the actual user scenario is important -- how crucial are these text files that repro the problem? How common are they? There is lots that I hate about IsTextUnicode, but this bug is more of an oddity than a "must fix" issue....

# Maximilian Haru Raditya on 13 Jul 2006 1:40 AM:

Hi Michael,

I can see that the actual user scenario is important and it's not a "must fix" issue. For me, it much more looks like such a rare occasion too.

I'm just hoping the next-to-come user scenarios would not break by this [IsTextUnicode], which I can't quite figure out them exactly for now what they would like to be in the future regards this matter. I hope everything would just be fine.

Anyway, thanks for the insight under the hood.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/03/25 Bush might've still hid the facts, but he can't hide them from Vista SP1/Server 2008 Notepad!

2007/04/22 The Notepad encoding detection issues keep coming up

go to newer or older post, or back to index or month or day