You didn't expect to be able to read any email on any device, did you?

by Michael S. Kaplan, published on 2011/02/25 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/02/25/10133926.aspx

I'm the "lead developer" for the e-mail application nPOPuk. I recently added some code to help a Russian user on a Windows CE machine: apparently, the KOI8-RU codepage is not installed on WinCE, so messages with charset="koi8-ru" were thoroughly corrupted.

So now I'm thinking about my app, which comes in Unicode and ANSI versions, and wondering:

1) Is there an easy way for the ANSI version to tell the user, "hey, you typed a Unicode character in the message body"? I typed some Cyrillic in the window, and when the app sent WM_GETTEXT, it got back a bunch of question marks.

2) The user can select the charset (UTF-8, ISO-8859-1, KOI8-R, etc.) for sending the message; is there an easy way to tell the user, "hey, there are characters in your message that aren't available in the charset you selected"?

I suppose I could (a) stop compiling the ANSI version and (b) force all messages to be UTF-8, but that seems draconian.

At least in the body of email messages, the ability to have text that uses different random encodings that the mail client will support has been a long-standing principle.

Not every message coming to an email client is limited to a single encoding that it understands.

Of course whether one extracts the context via RTF functions or HTML functions or some other means will largely depend on the client, though generally HTML seems to be the one that all mail clients support to some extent.

Although in theory you could support an email using any encoding using HTML, in fact there is a device-based limit in mobile devices since it may not be able to convert/parse/display every encoding.

If you are using Platform Builder to build an image for a device, you may have more flexibility here but even that only goes so far. Messages using KOI8-RU and other such code pages can suffer here, and there really isn't a good answer in the platform (though if it is a limited number of code pages one could just ship the tables for a few others)....

As to that second question, in order to read it in an app, you can just try to convert it to Unicode one way or another. If you are unable to do that conversion, you can definitely warn the user that you are unable to parse the text....

This story doesn't ring true - Windows CE has only ever been Unicode. It doesn't *have* ANSI versions of the 'User' APIs, e.g. GetWindowText. WinUser.h lists CreateWindow{Ex}(A/W), but coredll.lib only has CreateWindowExW. ANYTHING you want to do has to be converted via MultiByteToWideChar / WideCharToMultiByte.

And that's really the answer to the second question - use the lpUsedDefaultChar parameter to WideCharToMultiByte to find out whether it had to use the default character, because it didn't have a built-in mapping.

If the codepage really isn't present - although my copy of the CE 5.0 source has a 21866.txt in \Public\Common\OAK\Files, which looks like some kind of source code - then they'll just have to roll their own. codepage.txt lists 57 code pages: 437, 708, 720, 737, 775, 850, 852, 855, 857, 858, 860-866, 869, 874, 932, 936, 949, 950, 1250-1258, 1361, 20000-20005, 20127, 20261, 20269, 20866, 21027, 21866, 28591-28599, 28603, 28605 and 29001.

Of course there could well be some translation layer - I haven't looked at what nPOPuk is written in - which is doing the conversion with the default ANSI codepage before the app even sees it.

<blockquote>generally HTML seems to be the one that all mail clients support to some extent.</blockquote>

Uhhh, many console-based mail apps, and other mail parsers/responders that prefer the "other" alternative in "multipart/alternative" messages would disagree with you there. In my experience "text/plain; charset=utf-8" is the most universally reliable type for email. Heck, utf-8 is STD 63. It's not "draconian" to prefer or support a universally supported[0], 18-year-old, backwards-compatible-with-ASCII character set/encoding.

[0] All platforms that can parse XML parsers *must* support utf-8 to some degree, and I'm not aware of any that don't...

Actually, it seems koi8-ru is for Belorussian, not Ukrainian; the guy in question really was Russian.

Mike, there were several issues rolled together. Indeed, CE is Unicode, but the codepage for koi8-ru isn't available, and pMultiLanguage->ConvertStringToUnicode failed. So, indeed, I "rolled my own" -- ftp.unicode.org/.../MAPPINGS and wrote code to load those pages. Further, when the app writes to disk, it uses CP_ACP to convert Unicode to multi-byte strings. (A lot of the app was originally written by a guy in Japan.)

I'm still struggling to understand the difference between WideCharToMultiByte (which has the lpUsedDefaultChar argument) and the ConvertStringToUnicode (which doesn't, but this one has the charset argument -- I don't want a multi-byte result string, I want a string of single-byte (but perhaps 8-bit) characters in the chosen character set.

The app itself doesn't do any HTML rendering, which saves me from all sorts of "eye candy" and nasty JavaScript exploits, etc. :)

"Tell the user "hey, you typed a character".." "The user can select the charset"

What planet are these developers from? Why on earth should the end user be concerned about character sets in the year 2011? Unnecessary choice is BAD.

I had the displeasure of using a webmail system designed exactly like this. It was extremely unintuitive and confusing to the average user.

Force UTF-8 outgoing. 99.999% of the email clients and webmails out there understand UTF-8 now. The sooner we get rid of the rest, the better.

(If you truly do not want to do that for whatever reason, keep the default as KOI8-R but *if* the user types even a single non-KOI8-R character in the email, send it out as UTF-8. No prompting, no manual having to select the character set. Asking the average user to decide something like that is just asking for trouble. I think Gmail works precisely this way - with the default settings it uses whatever code page the message you're replying to utilizes but uses utf-8 automatically when necessary)

@mpz - Thanks for the interesting perspective; making things easier for the user is generally a good idea. I will probably make UTF-8 the default (presently, it seems to choose the charset based on the default codepage).

In my defense, the charset decisions were made by the original developer, and I didn't have any particular reason to take away options. I recently read a blog comment where someone was equally vociferous about UTF-8 being a bad choice to force upon the user (for efficiency reasons). Certainly for the Russian user, all those Cyrillic letters fit fine in an 8-bit character from the codepage, versus a 2-byte UTF-8 character. The original developer was Japanese and may have had some preferences from that perspective. On the other hand, I got an e-mail recently (from my company's IS dept.) where more than half the body was formatting gibberish from MS Office (ie, more formatting than content). So, efficiency isn't exactly the top priority.

Of course the default for outgoing mail has drawbacks for some languages in respect to size. But this issue is almost never about trying to change the default for outgoig mail; it is usually about how to read incoming mail.

Now if you can read the mail then the reply has one huge advantage is using the same encoding -- you know the original sender can probasbly read it. Thus the best encoding for the rep[ly is often the same as the incoming was.

I am hugely in favor of Unicode, but ignoring these issues is ignoring a reality that can really impact users. And as much as I may be a fan of Unicode, I'm bigger fan of users....

Geoffrey: OK, I didn't realise you were using MLang rather than WideCharToMultiByte/MultiByteToWideChar. MLang has a much smaller set of supported codepages, but again, looking at the CE sources, you should be getting this codepage. The supported codepages in \PUBLIC\IE\OAK\FILES\ie.reg are 852, 866, 932, 1200, 1201, 1250-1254, 1257, 20866, 21866, 28592-5, 28597, 50000, 50220-2, 50932, 51932, 65000 and 65001. I happen to have a Pocket PC 2002 device on my desk at the moment, which has the same data.

Looking at your source code, you're passing MIMECONTF_SAVABLE_MAILNEWS to EnumCodePages, which I suspect is unnecessarily restricting what you get back. I'd just pass 0 to get the maximum possible.

I do note that HKCR\MIME\Database\Charset has koi8-r (mapping it to 20866) and koi8-ru (21866), but not koi8-u. I found this article comparing koi8 encodings: segfault.kiev.ua/cyrillic-encodings - note the differences between koi8-u and koi8-ru. The CE 5.0 version seems to be a blend of the two:

0x93 0x2320 ;Top Half Integral [matches koi8-u from that page, not -ru]

0xae 0x045e ;Cyrillic Small Letter Belorussian Short U [matches -ru, not -u]

So you might need to continue using your own conversion to do it properly.

The term 'MultiByte' in the function names is simply a reflection that some of the 'ANSI' codepages (only one of them, ISO 8859-1, is *actually* an ANSI standard), mostly for the Far East, were too big for 8 bits and needed shift markers to add additional planes. These were sometimes called 'double-byte' character sets, but that's really a misnomer as some characters were represented as one byte, some as two. You will only ever get 8-bit values out of WideCharToMultiByte, but sometimes one character is represented by a pair of values.

I believe the ConvertString{From/To}Unicode functions exist to allow you to do runs of conversion in one buffer, rather than having to allocate a single buffer for the entire conversion. If you use WideCharToMultiByte, you might sometimes get only the lead byte of a double-byte character if the buffer was one byte too short. I think the pdwMode parameter to the function contains (or points to) the context necessary to complete the conversion properly on the next call. At the moment you're using the functions exactly as you would use WideCharToMultiByte and MultiByteToWideChar anyway - calling it once to find out how big a buffer you'd need, then a second time to do the conversion.