Your VC++ files don't support Unicode identifers? Drop a BOM on them!

by Michael S. Kaplan, published on 2007/06/22 15:10 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/06/22/3466480.aspx

What with Unicode being the default way that things are compiled in Visual Studio, the fact that identifiers were limited to ANSI has really been a sore point for a lot of people.

Thankfully, VC++ Architect Mark Hall helped set me straight on a not-entirely-well documented feature in the latest version of Visual C++.

If you don't like the lack of Unicode identifiers, then all you have to do is drop a BOM on your UTF-8 source!

And you do need the BOM for UTF-8 in this case (no matter how controversial that requirement may be). As one of the cool test leads over there pointed out:

In theory the BOM probably shouldn't be needed for UTF-8 given how easy UTF-8 detection is, but since NLS didn't provide a function I can hardly blame them for not wanting to go off and write their own (note that it *should* in my opinion be a part of IsTextUnicode, as I have pointed out before. But I don't know when (or even if) that might be happening.

The other problem would be that all of the important docs that refer to things like identifiers are still making ASCII assumptions in 7.1, 8.0, 8.5, and even current (preliminary) 9.0 docs.

I'll say more on this soon, both as soon as I find out what the answers are and as soon as I get the doc story to be updated (I assume the former will happen prior to the latter).

This post is sponsored by U+feff (ZERO WIDTH NO-BREAK SPACE, a.k.a. "Da BOM")

Maybe fixing rc handling would be higher priority?

https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=276954

I mean, resources are created to store international text.

re: C/C++ identifiers - I think the C and C++ standards probably bear some of the responsibility for ASCII identifiers - don't be *too* hard on the C++ compiler team :-)

Actually, the compiler team did a bunch of work here -- but somewhere between them and MSDN the communication of that work hit a snag or six.

But I am being given the rules and will be able to blog about them this weekend.... :-)

I agree that the RC and RichEdit stuff needs a high priority.Luckily the folks who have to do work here are all on different teams....

"I agree that the RC and RichEdit stuff needs a high priority.

Luckily the folks who have to do work here are all on different teams...."

In order to fix it, the bug should first be accepted as bug.

Which is not, in this case.

You have to have more faith in me, Mihai. :-)

RC.EXE/RCDLL.DLL are actually OS-provided components that VC/VS pick up from the OS. I'm on it....

"RC.EXE/RCDLL.DLL are actually OS-provided components that VC/VS pick up from the OS. I'm on it...."

Good to know.

I just did not have any idea that RCDLL.DLL also saves the files from the resource editor. And I did not imagine that RCDLL.DLL is that one that puts a warning message dialog if characters are lost. Maybe RCDLL.DLL returns an error and the UI part (Res Editor?) ignores it.

OK, I will wait and I will test the next release :-)

On another note, it would be really-really nice to have some public interface to RCDLL.DLL. It would save so much grief to all the localization/validation tools vendors that have to create proprietary RC parsers (most of the times buggy).

It might alleviate the problem with the Resource Editor not exposing any kind of automation, events, extensibility, like the rest of the IDE.

(although the right thing would still be to become part of the IDE Automation Model)

Mihai,

RC.exe can properly compile .rc files saved as UTF-16LE (strangely not UTF8-with-BOM though), but VS just can't edit them. I found this out while writing a database-to-.rc localization tool.

Remember to take out the "#pragma code_page(..)" bit.

@John: I know, rc can compile utf-16 files since (at least) VS6.

But the bug reported is about the VS 2005 (and Orca) resource editor, which can edit utf-16 files.

Read the bug description.

Perhaps it is just me, but I rate the bug that Mihai is referring to as being a much lower priority given that for most developers the localization process happens outside of Visual Studio.

In the scheme of things, I give UTF-8 support with rc.exe/rcdll.dll a *much* higher priority since it is much more likely to impact real users.

The "not automatically noticing the file format is insufficient" is a minor issue, and the "choice of default file format" is also much less crucial.

Just my opinion, of course....

About priority: Both of them we have workarounds.

It depends how you look at it.

Not supporting UTF-8 is just ugly. Saving as ANSI a file that contains text outside ANSI (without a warning) results in data loss. Notepad does better.

Personally I don't really care if a file is UTF-16 or UTF-8, as long as my data does not get corrupted and the final application works (I have a much worse bug filed against DLGINIT).

And providing workarounds for VS bugs keeps my web site interesting ;-)

Yes, but using VS as the localization TOOL is much less common. So yes it can corrupt, but no it is not corrupting in the most common scenarios....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.