Behind Norman's 'Who needs Unicode?' post

by Michael S. Kaplan, published on 2006/07/04 08:30 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2006/07/04/656051.aspx


In his usual charming style, regular newsgroup contributor Norman Diamond posted the following to the microsoft.public.win32.programmer.international and microsoft.public.word.international features newsgroups:

My hard disk has a file, whose path will likely be wrapped by Outlook Express:
C:\Program Files\Windows CE Tools\wce500\Windows Mobile 5.0 Smartphone SDK\Samples\CPP\Win32\Mapirule\readme.txt

Among useful bits of information are found the following:

> Client痴 transport (SMS, ActiveSync, POP3) arrives.

and:

> where <clsid> represents the COM object痴 class ID GUID

That deserves an award for being self-descriptive.  A maker of such things as an operating system for Windows Mobile 5.0 Smartphones, an SDK for the same, and compilers that can target the same, just knew that they didn't have to use Unicode for documents like this because the ANSI code page would get the message across just fine.  The ANSI code page is of course the one used by Notepad on desktop Windows systems such as XP, which defaults to code page 932 as delivered by Microsoft and preinstalled on PCs.  The word 痴 really does describe the process that led to displaying the word 痴.

One might wonder if a maker of tons of documentation on how to use Unicode might want to learn how to use Notepad to save a .txt file in Unicode encoding so that this documentation file might provide information using some unknown characters other than 痴.  A certain company which is known for being 痴 might have the ability to teach them.  But a certain company which is known for being 痴 might not want to learn from them.

He does have a way with words (see the sponsor tag line for the 痴 (U+75f4) ideograph if you don't know the meaning of 痴 and want a better understanding of Norman's humor!).

It is important to look beyond the words for a moment here, to see what we are talking about here. :-)

Those who read Behind 'How to break Windows Notepad' might have a hint of what is going on here -- we are looking at one of those "misunderstanding the characters in a file" problems. Though in this case it is not an ANSI file being mistaken for a Unicode one; it is an ANSI file in one code page on a machine whose CP_ACP is a different code page....

Now I will be the first to admit that it seems foolish to rely on something as fragile as the default system code page for the readme file of a sample.

If you convert to cp 932, you get 0x9273, which had it been treated as cp 1252 would be ’s which is the clitic that is used in English to indicate a possessive. Thus the acutual strings would be:

> Client’s transport (SMS, ActiveSync, POP3) arrives.

and:

> where <clsid> represents the COM object’s class ID GUID

where ’s is actually 0x92 and 0x73, which becomes U+2019 and U+0073 via cp 1252.

Now since Microsoft Word will commonly takes ' (U+0027) and autocorrects it to  (U+2019).

No keyboard that Microsoft releases sticks U+2019 in a file, so it really looks like the problem is that the text was edited in Word at some point. That it became is just a real bit of irony that helped Norman point an ironic sort of a bug and helped me provide a good Unicode Lame List story. :-)

 

This post brought to you by (U+75f4, a.k.a. a CJK Unified Ideograph meaning foolish, stupid, dumb, silly)


# Adam on Tuesday, July 04, 2006 12:32 PM:

Just wondering - is MS planning on making any version of Windows use the UTF-8 codepage (65001) by default (ANSI and/or OEM) at any point in the future?

# Michael S. Kaplan on Tuesday, July 04, 2006 12:43 PM:

Hi Adam,

This is not in the current POR, due to the fact that the various "A" functions are simply not built to handle a stream that can be up to four bytes per character....

# Adam on Tuesday, July 04, 2006 1:21 PM:

Why? Don't they use MB_CUR_MAX (or MB_LEN_MAX for static buffers)?

# Michael S. Kaplan on Tuesday, July 04, 2006 1:55 PM:

Adam, are you kidding? These functions were written ten years ago, some even longer. It is lucky that even DBCS is supported here!

Revisting thousands of legacy functions to make them solidly support UTF-8, working in both user and kernel mode, and asking the test team to run tens of thousands of new test cases in all of them, is simply too huge of an effort to ask of the Windows team.

# Adam on Tuesday, July 04, 2006 3:18 PM:

Yeah, MB_{CUR,LEN}_MAX were only part of the original 1989 ANSI/1990 ISO C standard (and would therefore have been in a number of drafts and included in some implementations some time before that). 16 years is _way_ too short a time to allow for that sort of thing to eke its way into the Windows codebase.

And given that it'll take, what, maybe 5 years(?), between changing MB_LEN_MAX to 4 and a significant portion of Windows + Apps actually supporting it properly, the longer that change is put off, the better. Yes?

# Michael S. Kaplan on Tuesday, July 04, 2006 4:27 PM:

Hmmm... sarcasm ignored. :-)

Coulda, woulda, shoulda -- it is a bit too late to argue how functions should have been written over a decade ago.

If you want to support Unicode, there are thousands of functions to support that -- and functions to convert between UTF-8 and the chosen preferred form of Unicode for the platform.

There really isn't an interest or desire in re-architecting ten thousand functions to support something that was not kept in mind by the hundreds of developers who did the original writing.

Like I said, coulda, woulda, shoulda. But we can't go back, and Unicode is supported (and the old ISO specs had a lot of problems with their Unicode support that were not solves until the 1999 update that Microsoft was doing a much better job on in the meantime!)....

# Ruben on Tuesday, July 04, 2006 5:09 PM:

Somehow the adolescent in me wants to try and find sentences that translate into funny Chinese ones... I think I'm going to have a talk with his mom about that ;-)

But seriously. It's a pity DBCS doesn't include UTF-8 support. Although UTF-8 probably doesn't really fall in the 'DB' category.

Besides, the more file formats are moved to XML or UTF-8 + BOM, the less this matters. Except for Notepad, which doesn't look at <?xml encoding="..."?>

# Maurits on Tuesday, July 04, 2006 7:58 PM:

Sometimes I wish that Group Policy had a "no smart quotes" setting :(

# Adam on Tuesday, July 04, 2006 8:04 PM:

Michael: Excessive sarcasm aside (sorry :), I don't see what you're getting at. So a bunch of old functions weren't necessarily written to best practices all the time. It happens, I'm aware of that. But why can't MS go back?

I'm sure as part of their recent security drive they've been going back over all that old code and fixing up calls to gets() or similar brokenness. Saying "Oh, that bug is _old_, therefore we can't fix it" seems, I dunno, like a bit of a poor excuse. Bugs that have just come to light, but happen to have existed for ages, still get fixed; don't they?

Apologies again - I realise that you didn't write the 10000 legacy functions, and probably don't manage the people who'd be fixing them if that were to happen, so don't take this too seriously. It just pains me when MS doesn't want to put resources to implementing something which would be a technically good idea and which other vendors (e.g. MacOS X and GNU/Linux (and, of course, Plan 9 :-) can use UTF-8 as the default user (and system?) encoding) are implementing, and I tend to go off on one a bit.


Ruben: UTF-8 + BOM? Bleargh. Check RFC 3629 (http://www.faqs.org/rfcs/rfc3629.html) section 6 and the Unicode UTF & BOM FAQ (http://www.unicode.org/unicode/faq/utf_bom.html#BOM)

Yeah, you /can/ do it, but don't unless you /really/ have to.

UTF-8 can be sniffed pretty well anyway (better than a few other MBCSs) as the bit-pattern requirements for bytes 2-n of multibyte sequences are pretty strict.

# Michael S. Kaplan on Tuesday, July 04, 2006 9:01 PM:

Adam,

This is more than just a bug fix in many cases. This is a literal per-function review, modfication, and test cycle for whole new scenarios across thousands and thousands of functions -- it is simply too huge of a project, for NO GAIN. Because all it would do is be a slightly slower version of the same one that has already existed.

We already support Unicode. The non-Unicode version is there for legacy, and is not being updated. It is only a technically good idea for people who don't want to update their byte-based code to support the actual Unicode interface that has been on the platform for over a decade....

# Michael S. Kaplan on Tuesday, July 04, 2006 9:10 PM:

To look at it another way, since you are comparing it to the Security reviews that have happened....

SECURITY -- learning about the latest best practices and then assessing threats and fixing them has a specific benefit -- a more secure code base.

But the update you refer to would also break over a decade of third party applications that have been written, not to mention earlier versions of MS products that would also be broken. It is intentionally engineering the single largest backcompat break of all the time, and spending millions of dollars to do it!

# Maurits on Wednesday, July 05, 2006 12:02 AM:

What about adding a file system metadata field that stores MIME type information?  That would have solved this problem and the "How to break Windows Notepad" problem.

Imagine... a .txt file with a MIME type metadata field of
text/plain;charset=iso-8859-1
or
text/plain;charset=utf-8
etc.

# Michael S. Kaplan on Wednesday, July 05, 2006 12:06 AM:

For a plain old text file? Hmmm.

On the whole, this can usually be thought of as overuse/misuse of Notepad, if you know what I mean....

It is why I laugh every time someone complains about how Notepad's UTF-8 BOM is breaking their UNIX shell scripts. Talk about irony. :-)

# Adam on Wednesday, July 05, 2006 8:46 AM:

Urgh! Under what conditions does notepad insert a UTF-8 BOM? If I open a UTF-8 (without BOM) file in notepad as UTF-8, make changes and save it again, does it suddenly get a BOM?

# Michael S. Kaplan on Wednesday, July 05, 2006 10:16 AM:

Yes Adam, that is what happens.

See http://blogs.msdn.com/michkap/archive/2005/01/20/357028.aspx for more info on the argument....

# Adam on Wednesday, July 05, 2006 7:32 PM:

OK, I can get that hint - it's all been said before.

*sigh* :-/

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/01/03 UTF-8 and GB18030 are both 'NT' code pages, they just aren't 'ANSI' code pages

2006/07/14 Can the CP_ACP be UTF-8?

go to newer or older post, or back to index or month or day