You can do CESU-8 if you need to; we went in a slightly different direction....

by Michael S. Kaplan, published on 2012/01/23 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/01/23/10259523.aspx

We just upgraded our customer desktops from Windows 2000 to Windows 7, and we're seeing a major break in our text processing app.

We've debugged the problem pretty thoroughly, and it doesn't look it's our app at all. Notepad seems to be breaking our Plane 1 and Plane 2 text. Which seems like it must be impossible, isn't support of supplementary characters a Windows 7 feature?

I'll admit I was confused at first, though the problem he described seemed kind of familiar.

The app was basically supporting supplementary characters on Windows before we really were.

But the was a weird time before XP shipped that we were okay with the six byte form for supplementary characters, before Unicode got more explicit about considering it to be ill-formed and before we started conforming to Unicode's stricter definition....

Dan's Line of Business app was essentially using CESU-8, not UTF-8. And given the weird difference between how Notepad initially detects UTF-8 and how it converts the data -- described previously in blogs like (It wasn't me) -- and the solution becomes clearer.

Personally, I'd recommend the former option -- the latter is kind of contrary to what Windows, Microsoft, and Unicode are doing these days.

Though if an application has a heavy investment in the 6-byte form, then as long as it is kept internal to the app (or properly marked when communicating with those who understand it), it isn't the end of the world....

Uh-uh. No bueno. The so-called "6-byte form" -- applying the UTF-8 algorithm to UTF-16 code units -- was NEVER valid. UTF-8 has always been defined in terms of "characters" or abstract code points. All Unicode did was insist that apps detect and reject the invalid form. Dan has to convert his data. No defense.

Personally I'm fine with using whatever bizarre encoding(s) someone wants to use so long as they: don't lose information, are sufficiently specified, and don't have bad performance re-encoding. The only problem* with CESU-8 is when people claim it's UTF-8!

* Other than having to add some otherwise unnecessary code - if you can simplify things, do so!

I mean, which encodings did Windows support prior to UTF-8 anyway? I thought that it was all ASCII then.

We supported a bunch of different ACPs and OEMCPs, each of which had limited coverage....

@Simon: Or when they expect Windows (or other OS, or apps) to support broken UTF-8 as if it were well-formed. Even if it used to.

There's another problem with CESU-8: it's unnecessarily large.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.