Your VC++ files don't support Unicode identifers? Drop a BOM on them!

by Michael S. Kaplan, published on 2007/06/22 15:10 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/06/22/3466480.aspx


What with Unicode being the default way that things are compiled in Visual Studio, the fact that identifiers were limited to ANSI has really been a sore point for a lot of people.

Including me. :-)

Thankfully, VC++ Architect Mark Hall helped set me straight on a not-entirely-well documented feature in the latest version of Visual C++.

If you don't like the lack of Unicode identifiers, then all you have to do is drop a BOM on your UTF-8 source!

You can read more about it in the topic Unicode Support in the Compiler and Linker.

And you do need the BOM for UTF-8 in this case (no matter how controversial that requirement may be). As one of the cool test leads over there pointed out:

...we don’t support UTF8 w/o BOM, since it’s pretty much indistinguishable from ANSI

Note that you do not need a BOM for UTF-16LE or UTF16-BE.

In theory the BOM probably shouldn't be needed for UTF-8 given how easy UTF-8 detection is, but since NLS didn't provide a function I can hardly blame them for not wanting to go off and write their own (note that it *should* in my opinion be a part of IsTextUnicode, as I have pointed out before. But I don't know when (or even if) that might be happening.

Maybe I'll just post a little IsTextUtf8 function in the meantime. :-)

The other problem would be that all of the important docs that refer to things like identifiers are still making ASCII assumptions in 7.1, 8.0, 8.5, and even current (preliminary) 9.0 docs.

So one should not assume that the full support of Unicode Standard Annex #31 (Identifier and Pattern Syntax) is being implemented, but hopefully some not entirely incompatible subset is what would turn out to be available....

I'll say more on this soon, both as soon as I find out what the answers are and as soon as I get the doc story to be updated (I assume the former will happen prior to the latter).

 

This post is sponsored by U+feff (ZERO WIDTH NO-BREAK SPACE, a.k.a. "Da BOM")


# Mihai on 22 Jun 2007 5:16 PM:

Maybe fixing rc handling would be higher priority?

https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=276954

I mean, resources are created to store international text.

# Stuart Dootson on 22 Jun 2007 5:46 PM:

re: C/C++ identifiers - I think the C and C++ standards probably bear some of the responsibility for ASCII identifiers - don't be *too* hard on the C++ compiler team :-)

# Michael S. Kaplan on 22 Jun 2007 5:50 PM:

Actually, the compiler team did a bunch of work here -- but somewhere between them and MSDN the communication of that work hit a snag or six.

But I am being given the rules and will be able to blog about them this weekend.... :-)

I agree that the RC and RichEdit stuff needs a high priority.Luckily the folks who have to do work here are all on different teams....

# Mihai on 22 Jun 2007 7:52 PM:

"I agree that the RC and RichEdit stuff needs a high priority.

Luckily the folks who have to do work here are all on different teams...."

In order to fix it, the bug should first be accepted as bug.

Which is not, in this case.

# Michael S. Kaplan on 22 Jun 2007 7:57 PM:

You have to have more faith in me, Mihai. :-)

RC.EXE/RCDLL.DLL are actually OS-provided components that VC/VS pick up from the OS. I'm on it....

# Mihai on 23 Jun 2007 2:45 PM:

"RC.EXE/RCDLL.DLL are actually OS-provided components that VC/VS pick up from the OS. I'm on it...."

Good to know.

I just did not have any idea that RCDLL.DLL also saves the files from the resource editor. And I did not imagine that RCDLL.DLL is that one that puts a warning message dialog if characters are lost. Maybe RCDLL.DLL returns an error and the UI part (Res Editor?) ignores it.

OK, I will wait and I will test the next release :-)

# Mihai on 23 Jun 2007 2:50 PM:

On another note, it would be really-really nice to have some public interface to RCDLL.DLL. It would save so much grief to all the localization/validation tools vendors that have to create proprietary RC parsers (most of the times buggy).

It might alleviate the problem with the Resource Editor not exposing any kind of automation, events, extensibility, like the rest of the IDE.

(although the right thing would still be to become part of the IDE Automation Model)

# John Bates on 24 Jun 2007 10:20 PM:

Mihai,

RC.exe can properly compile .rc files saved as UTF-16LE (strangely not UTF8-with-BOM though), but VS just can't edit them. I found this out while writing a database-to-.rc localization tool.

Remember to take out the "#pragma code_page(..)" bit.

# Mihai on 25 Jun 2007 12:35 PM:

@John: I know, rc can compile utf-16 files since (at least) VS6.

But the bug reported is about the VS 2005 (and Orca) resource editor, which can edit utf-16 files.

Read the bug description.

# Michael S. Kaplan on 25 Jun 2007 1:35 PM:

Perhaps it is just me, but I rate the bug that Mihai is referring to as being a much lower priority given that for most developers the localization process happens outside of Visual Studio.

In the scheme of things, I give UTF-8 support with rc.exe/rcdll.dll a *much* higher priority since it is much more likely to impact real users.

The "not automatically noticing the file format is insufficient" is a minor issue, and the "choice of default file format" is also much less crucial.

Just my opinion, of course....

# Mihai on 25 Jun 2007 9:21 PM:

About priority: Both of them we have workarounds.

It depends how you look at it.

Not supporting UTF-8 is just ugly. Saving as ANSI a file that contains text outside ANSI (without a warning) results in data loss. Notepad does better.

Personally I don't really care if a file is UTF-16 or UTF-8, as long as my data does not get corrupted and the final application works (I have a much worse bug filed against DLGINIT).

And providing workarounds for VS bugs keeps my web site interesting ;-)

# Michael S. Kaplan on 25 Jun 2007 9:30 PM:

Yes, but using VS as the localization TOOL is much less common. So yes it can corrupt, but no it is not corrupting in the most common scenarios....

neenu.p on 22 Jan 2011 4:05 PM:

thank u............................


referenced by

2007/06/26 The MB_PRECOMPOSED flag is stupid, and the MB_COMPOSITE ain't no genius either

2007/06/25 Sometimes you drop the BOM, and sometimes the BOM drops you!

go to newer or older post, or back to index or month or day