by Michael S. Kaplan, published on 2006/07/11 10:26 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/07/11/662342.aspx
In the latest comment to the post that keeps on going (Behind 'How to break Windows Notepad'), Sanjay Vyas asks:
Not all combinations of 4-3-3-5 will produce it. For example, Bush hid the truth does not work while Bush hid the facts does. Any explanation?
Well, let us look at the two strings. First:
Bush hid the facts
0042 0075 0073 0068 0020 0068 0069 0064 0020 0074 0068 0065 0020 0066 0061 0063 0074 0073
becomes:
The obvious question is why the second string does not do the same thing. Why does:
Bush hid the truth
0042 0075 0073 0068 0020 0068 0069 0064 0020 0074 0068 0065 0020 0074 0072 0075 0074 0068
not become the analagous:
exactly?
Neither string is very useful from a meaning standpoint, so we can dispell conspiracy theories involving both Japan and China right away (thankfully!).
If you run both bits of text through IsTextUnicode running all tests, the first one returns TRUE and only returns IS_TEXT_UNICODE_STATISTICS, which means it only won the statistical tests.
In a comment to the digg thread, neko asked (but no one answered):
I wonder if this will give the same result if you run notepad.exe in wine? Does wine emulate the dodgy isTextUnicode() behaviour as well?
I am not sure, though I am inclined to doubt that they are running the same statistical tests. A bit of Google spluenking suggests that the code is here somewhere, but I decided not to look myself. Hopefully it is not a dead link. Someone can tell me later if I was right. :-)
I won't post the actual code for IsTextUnicode -- no sense getting in thast kind of trouble, and even if that did not matter it is kind of embarrassing code....
(As a side note, the results on Windows vary a little bit depending on the default system locale as the function looks a bit at lead bytes -- and if the ratio of bytes in the string to lead bytes according to a DBCS default system code page is 2:1 -- which could in theory mean that on a Chinese, Japanese, or Korean system that the results could vary some....)
The tests (on Windows) are rather arbitrary and consist of a few parts, but the biggest piece it is really testing is a comparison of the fluctuation between high bytes and low bytes, and the diff between various high bytes and low bytes that the second string is failing on.
I honestly see no good reason for it to return TRUE here in either string, though I ran into problems even trying to fix this bug with CRLF due to it breaking a use of the function in detecting a Unicode JScript file, so changing the tests that the function does here is probably a no-no.
Though it really is not a conspiracy with Microsoft making critical comments on the president's use of TRUTH or FACTS, I doubt I will have much luck convincing people of that.
To me the more interesting conclusion here is that the passing of an arbitrary Unicode String like "畂桳栠摩琠敨琠畲桴" to IsTextUnicode returns FALSE due to these statistical tests. Once again, this is just not a function I like (or trust!) very much.
But it is not some kind of easter egg. Truly, it is just a dumb algorithm!
This post brought to you by 桴 (U+6874, a CJK ideograph)
# Dean Harding on 11 Jul 2006 7:53 PM:
# Michael S. Kaplan on 11 Jul 2006 10:19 PM:
# Michael Dunn_ on 11 Jul 2006 11:21 PM:
# Michael S. Kaplan on 12 Jul 2006 1:04 AM:
# Maximilian Haru Raditya on 12 Jul 2006 6:29 AM:
# Michael S. Kaplan on 12 Jul 2006 8:28 AM:
# Maximilian Haru Raditya on 13 Jul 2006 1:40 AM:
referenced by