That non-character can be the BOM at the right kind of party

by Michael S. Kaplan, published on 2008/05/02 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/05/02/8449105.aspx


The question from Vikash was:

Hi,

I am using CreateFile API to generate a Unicode file

I have added the Unicode flags in my sources file and the file that is generated is also in Unicode format except for the first two bytes which do not have 0xFEFF

However if I save the file as Unicode then these two bytes get added to it.

I am creating the file with the following attributes:

          hFile = CreateFile(L"D:\\AddinConfig.Addin",
                             GENERIC_WRITE, 
                             FILE_SHARE_READ|FILE_SHARE_WRITE, 
                             NULL, 
                             CREATE_ALWAYS,
                             FILE_ATTRIBUTE_NORMAL,
                             NULL);

How can I generate a Unicode file with the first two bytes as 0xFEFF?

Thanks in advance,
Vikash

There it is, the good old U+feff, aka ZERO WIDTH NO-BREAK SPACE, aka the BYTE ORDER MARK.

The semantics of file creation in Win32 don't ever add the BOM automatically -- the person writing data out to the file had to add the character on their own.

Though it is probably a good idea to add it as a WCHAR directly, and not as bytes -- that way whatever endian-ness you are using will be the way that it is automatically written out.

Because if you have 0xfeff in a USHORT or a WCHAR on Windows (a little endian shop), when you write it out what actually gets written is a byte reversed 0xff 0xfe -- the little end first. So writing pure bytes out might put in a big endian BOM if you blindly write out bytes....

But in the end, the BOM is not always required for a Unicode file. It is fine to add one if you want and if you have a detection requirement or cross-platform needs then it might make sense. But it is optional to do.

Now in .NET this is much more a part of the platform, with the option for BOM-writing as part of Unicode file creation available directly, and interestingly the big-endian vs. little-endian issues are much less theoretical there given the conceptual and potential cross-platformishness of the .Net Framework. The lack of this support in Win32 may yet be a cause for grief for people.

I suspect if I knew less about this stuff I'd get invited to parties that are more fun. :-)

 

This blog brought to you by U+fffe, a non-character.


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day