The MB_PRECOMPOSED flag is stupid, and the MB_COMPOSITE ain't no genius either

by Michael S. Kaplan, published on 2007/06/27 00:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/06/26/3558548.aspx


The other day when I suggested that if Your VC++ files don't support Unicode identifiers? Drop a BOM on them!, John Bates commented:

RC.exe can properly compile .rc files saved as UTF-16LE (strangely not UTF8-with-BOM though)...

 The reason UTF-8 is not supported here is not due to any brilliant technical issue, though.

Basically, with the exception of UTF-16, code page support is via a simple command line switch, as described in Using RC (The RC Command Line):

/c

     Defines a code page used by NLS conversion.

Take this doc at its word -- this switch literally tells the Resource Compiler what code page value to feed to a MultiByteToWideChar call.

(FYI, this is also why you cannot pass 1200 or 1201 for UTF-16 LE/BE -- because MultiByteToWideChar does not support these code pages!)

A call which, unfortunately, is done with the MB_PRECOMPOSED flag. The flag that briefly came up in my post A few of the gotchas of MultiByteToWideChar.

I say unfortunately because of that note in the MultiByteToWideChar topic:

For the code pages listed below, dwFlags must be set to 0. Otherwise, the function fails with ERROR_INVALID_FLAGS.

50220       50227       57002 through 57011
50221       50229       65000 (UTF-7)
50222       52936       42 (Symbol)
50225       54936

Note: For UTF-8, dwFlags must be set to either 0 or MB_ERR_INVALID_CHARS. Otherwise, the function fails with ERROR_INVALID_FLAGS.

Aha, so UTF-8 is failing in the Resource Compiler because it is always including a flag that is documented as not working with UTF-8 (or a bunch of other code pages).

Now as the title of this post indicates, the MB_PRECOMPOSED flag is stupid.

To do their work, MB_PRECOMPOSED and MB_COMPOSITE actually use the lame tables that FoldString used to suppose MAP_PRECOMPOSED and MAP_COMPOSITE prior to Vista (when it started using normalization). I cal;l these tables lame since they are incomplete. But no one wanted to slow down MultiByteToWideChar by making it normalize text, and no one wanted to update this lame set of tables, so everything was left as is. They are just dumb flags to use, ever. You should just normalize if you want to get into Form C or Form D, and call it a day.

Anyway, I am sure that the UTF-8 issue in the Resource Compiler will see itself fixed in some upcoming version, given how easy it is to either (a) never pass a stupid flag or at least to do the minimal change and (b) never pass a stupid flag if it makes the function fail on a code page you do not want it to.

(In an ideal world it would also recognize the UTF-8 BOM just like it recognizes the UTF-16 BOM, but again the whole minimal change thing would probably have a lot of influence here!)

 

This post is sponsored by U+feff (ZERO WIDTH NO-BREAK SPACE)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/01/12 On my "Vietnamese Plus" and "pseudo-Form V" constructs

2008/09/25 When to make a change, when to stay the same

go to newer or older post, or back to index or month or day