Everyone seems averse to the BOM these days; Should we blame TSA? :-)

by Michael S. Kaplan, published on 2008/05/19 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/05/19/8518545.aspx


Sometimes things work by accident.

You know -- no one ever planned for it to work, no one tested it to make sure it kept working. Of course it was never documented; the people behind the scenes may not have even known the feature ever worked the way some people were using it.

It can go on for a long time, that kind of unintentional enablement of a scenario.

Of course, people relying on the undocumented, or more importantly the unintended, have one thing they have to worry about.

They are living on borrowed time.

Because one day a security problem might be detected that this feature is based on. Or one day an actual intended feature may break compatibility with this feature that never made it on the official, documented, supported, and intended radar.

This blog is about such an "unintended feature".

The other day, in response to my The new complier error C4819 blog from years ago that discussed a new compiler error intended to flush out bugs with inappropriately encoded source files (a longtime problem when developers moved projects between different system locales) Vladislav Vaintroub commented:

Too much i18n does not seem good for the compiler.

Michael, by all respect I cannot share your view on this "incredibly cool" feature. I think it is incredibly uncool.

The bad thing about this warning  can result to an error like here

http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=341454

is that : C strings,  the null terminated arrays of bytes, do not have any encoding information per se, i.e are supposed to be treated as opaque arrays of bytes. Now, I have a perfectly valid C file, containing ASCII-only, except for UTF8 bytes instrings (UTF8 for a good reason, I intend to edit this file in UTF-8 editor). And such a file will break with incomprehensible message on Whidbey on Japanese Windows now

The connect bug is now resolved with Won't Fix, so I can not even hope that this will be fixed with the next version of the compiler.

Alternatives for me?
1)Documentation and support says  - add a BOM to the file.
          No way, then it will break on older compiler and on non-Microsoft compilers.
2)#pragma setlocale?
         
Does not work
3) convert strings  to  their hex-byte-array array form. something like char foo={0xba,0xad,0xf0,0x0d,0x00}?
         
Will work, will look ugly and I'll have to forget about editing this file in a my wonderful UTF8 -capable editor , VS2005 IDE.

Or forget about getting this file compiled on Japanese Windows. It is not important *for me* anyway.  This compiler works quite well on latin1 territories:)

First of all, note that the old situation didn't always work, and there was a bug to fix here to make sure that a known issue would no longer be an issue.

And remember that UTF-8 without BOM support was never promised or intended -- it was only working when the default system locale assumption didn't step in the sequences used for UTF-8.

But the "move a project to a new machine and watch the project have problems in compiling, linking, or in the final application" problem? THAT problem was on the radar.

And then they fixed it -- with a level 1 compiler warning C4819.

I have talked about this one in several blogs, e.g. here, here, here, and here.

Now I happen to agree with the comments that Jonathan Caves from the VC++ Compiler Team made in that Compile error with source file containing UTF8 strings (in CJK system locale) VS Feedback submission:

our suggestion for fixing this issue would be to use a BOM - this unambiguously lets the compiler know the encoding of the file - without this the compiler needs to revert to guess work.

you are correct the BOM is not part of the C++ Standard - but if you want non-ASCII characters then the "official" and portable way to get them is to use the \u (or \U) hex encoding (which is, I agree, just plain ugly and error prone).

The compiler when faced with a source file that does not have a BOM the compiler reads ahead a certain distance into the file to see if it can detect any Unicode characters - it specifically looks for UTF-16 and UTF-16BE - if it doesn't find either then it assumes that it has MBCS. I suspect that in this case that in this case it falls back to MBCS and this is what is causing the problem.

Being explicit is really best and so while I know it is not a perfect solution I would suggest using the BOM.

It is funny, it was just days ago that I blogged about another issue with a missing BOM (in My Spidey senses blame the rogue text editor, about .KLC source files in MSKLC).

Maybe it is something in the water.

Or a generic fear of BOM instilled in people by the Transportation Security Administration (TSA) in the United States, which leads to this kind of misunderstanding....

 

This post brought to you by U+feff, aka ZERO WIDTH NO-BREAK SPACE)


Andrew Cook on 19 May 2008 12:11 PM:

#pragma warning (disable : 4819)

as line 1 in your code files? Does that do the trick?

http://msdn.microsoft.com/en-us/library/2c8f766e(VS.71).aspx

Tyler on 19 May 2008 2:30 PM:

So, I usually run my system with Japanese locale.

Oddly, this tends to make me unable to build some projects, because someone somewhere used an editor that converted "..." to a triple '.' character (in a comment).

It's a pain in the posterior to walk away from a compile, come back a while later, and see you stopped about a quarter of the way through because some yahoo forgot to turn off autocorrect, and never set a BOM.

The experience is not that uncommon, either.   Anyhow, that's my 'How I Learned to Stop Worrying and Love the BOM' story.  Either people need to add a BOM to all sourcefiles, or someone needs to specify the standard encoding for this stuff. (Which I feel ought to be either utf-8 or UCS-32).

Michael S. Kaplan on 19 May 2008 3:07 PM:

UCS-32? No such beast, Tyler -- there is UTF-32 and UCS-4.

But not allowing UTF-16 here would be a huge mistake given the many platforms that have this as their default encoding, including Windows. It is doubtful that any standard that did not account for that would be very widely adopted....

Though I am firmly in favor of disabling Word Autocorrect with extreme prejudice! :-)

Tyler on 19 May 2008 3:53 PM:

Bah.  One of the 4 byte encodings, anyway. :)

vonsrdmn on 19 May 2008 5:20 PM:

It is quite a bit more complicated, but it seems that the IDE itself could manage this. Since the standard itself calls for using the \u escape sequence to store unicode strings (AFAIK and per the original post) in source files, it seems that the VS IDE could parse the source file when opening/closing and render on the screen the correct characters, but actually store the string in the \u notation.

This way you get an easy to UI, and you get a file that conforms to the standard. Obviously this is a lot more work to implement though...

Michael S. Kaplan on 19 May 2008 5:38 PM:

It would also break the expectations of the person asking the original question -- someone doing cross-platform compiles with their source in an encoding (UTF-8) that does *not* \u encode so that they can see their strings as strings in their editor, which chokes on a BOM.

Parsing the entire file to do UTF-8 detection does add a potentially huge performance burden on large source files, so I am glad that approach is not what currently happens. :-)

Ambarish Sridharanarayanan on 20 May 2008 2:50 AM:

"Parsing the entire file to do UTF-8 detection does add a potentially huge performance burden on large source files, so I am glad that approach is not what currently happens. :-)"

This should not be necessary. The real issue is that the default (if the compiler doesn't detect UTF-16) is MBCS, when it ought to be UTF-8.

Michael S. Kaplan on 20 May 2008 2:56 AM:

Since there is more than a decade of MBCS as the default, a UTF-8 default is much more likely to break people than the default that is there now. I'm a fan of Unicode, but I am an even bigger fan of not breaking something that has worked a specific way for more years than Unicode (and definitely more than UTF-8) even really existed....

Mike Dimmick on 20 May 2008 5:51 AM:

This is one of the areas that the C++ standard leaves 'implementation-defined'. They do not define the character set of the source files. As far as the compiler is concerned, it's a 'translation unit' and how the source text appears at the compiler front-end is up to the vendor.

Traditionally when people were still working with 7-bit ISO-646 variants, C users in non-US locales had to write their code using whatever characters matched the code point in ISO-646-US (ASCII). So in Denmark you'd have to write Æ where you needed a [, Ø for a \, Å for ], æ for a {, ø for a | and å for }. So your Hello World program starts looking like:

int main()

æ

   printf( "Hello World!Øn" );

   return 0;

å

Not exactly intuitive. So POSIX defined iso646.h which defines macros for a number of problematic operators (&&, &=, &, |, ~, !, !=, ||, |=, ^, ^=). Separately, trigraphs - sequences of three characters beginning ?? - were added as a workaround. Finally, digraphs were also added as alternate ways of spelling #, ##, [, ], { and }.

Paul Dempsey on 23 May 2008 7:39 PM:

It would be nice if the compiler added a command line switch to specify the encoding of the source file. This would provide a tool to manage the situation without possible breaking cross-plat/sys/compiler compatibility by changing the actual file encoding.

Then you could manage those legacy source files where different strings contained text pasted from a number of _different_ encodings. This worked when built on a non-MBCS system because the compiler would leave the bytes alone, and each string would be used only in the matching locale. (I've seen this long ago - would be doen differen today).

It is possible that

 #pragma setlocale

would be of use when faced with this situation. Look it up.

As I recall, this gets very tricky and hard to understand what the #pragma really means. When I last looked into this many years ago the docs had more details, bbut still confusing.

Gratifying to see the nice comment about VS as a nice UTF-8 editor. You're welcome :-).

--- P

Michael S. Kaplan on 23 May 2008 10:44 PM:

That would just be the beginning of course; all of the following need to be considered:

And all of that would have to be done *per* file if they are to be command line flags....


referenced by

2009/01/07 Someone please detect if there's a BOM before the plane takes off!

go to newer or older post, or back to index or month or day