The one code page that changed recently

by Michael S. Kaplan, published on 2007/08/04 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/04/4218091.aspx

But there is one exception to that rule, a code page that changed just recently (and which, should the circumstances that led to the change happen again, could even recur....

Changes have happened to this code page from time to time in order to make sure that its conversions are conformant to the Unicode definition, which has undergone subtle changes in relation to invalid sequences (such as unpaired surrogates).

When those changes happen, it is hard to know what to do with a code page that is defined as following Unicode -- though in the end the decision was made that the promise of that definition would trump the promise to make no changes in the case of UTF-8, since the changes themselves relate to security issues.

Does this mean that Vista, or some forthcoming version of Windows claims to actually have a standards compliant UTF-8 implementation? Does it also mean that someone tested it (you perhaps) ? Through which APIs is this compliant implementation exposed ?

It certainly would be nice if in a few years people could expect to use Unicode in Windows without tremendous pain and inconvenience.

Actually, several versions have been compliant (with the ever changing definition provided by Unicode) including the one in Vista and in .NET with the latest release (either the error out behavior we give via the flag or the U+FFFD substitution we give without are conformant).

We know that in the past at least the MultiByteToWideChar family and the Internet Explorer code haven't obeyed the specification. This wasn't because it was "ever changing" but because they either simply weren't tested to see whether they actually did what the standard said or they were tested and no-one cared that they failed.

Let me be absolutely clear, the UTF-8 specification does not now and never has said "Don't worry about the top bits on trailing bytes, they're optional and it's much faster to ignore them" nor did it say "Arbitrary UTF-16 code units are the same thing as Unicode codepoints, so you can just encode them as UTF-8 with no intermediate steps and it'll be fine". The only thing that changed was that in 2001 accepting overlong sequences went from being an optional and (to everyone else except Microsoft apparently) obviously insecure practice to being specifically called out as forbidden in the standard.

So hence I'm interested in whether you actually tested it, and which APIs you believe are fixed. Call it cynicism born of doing QA on real software if you like.

Please test with the latest .NET 2.0 and patches or Vista RTM and then get back to me, please.

I have made a conformance claim, and would suggest it is now on you to either prove me wrong or not, but complaining about what came before in a time when MUCH was in flux and the standard was more relaxed about what was alowed than what was admitted is not really productive....

If you know what I mean (and even if you don't!). :-)