by Michael S. Kaplan, published on 2009/01/07 10:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2009/01/07/9287052.aspx
One can really never get enough of puns about the BOM (Byte Order Mark) and TSA.
And when I say one, I mean I. :-)
Just think back to blogs like Don't sneak a BOM in on someone who promises to ignore free space or Everyone seems averse to the BOM these days; Should we blame TSA? :-) or How to get yourself imprisoned [by/for talking about Unicode].
See what I mean?
I was reminded of this when Pritam asked:
Is there any tool or code available to verify Byte Order Mark signature in XML files?
Of course sniffing out a few bytes is easy enough. Abhinaba provided the full chart of valid BOM values:
Bytes |
Encoding
Form |
00 00
FE FF |
UTF-32,
big-endian |
FF FE
00 00 |
UTF-32,
little-endian |
FE
FF |
UTF-16,
big-endian |
FF
FE |
UTF-16,
little-endian |
EF BB
BF |
UTF-8 |
Easy, right?
Okay, anyone want to make a try at writing the minimal code BOM detector?
Think of it as a way to play your part in airport security!
Points awarded for clearest, or for most concise, or for briefest, or for most clever, or for the sake of maintainability, most smart.
If you can write something able to handle other, non-standard byte orderings of data, then you probably went to Cal Tech! :-)
This post brought to you by U+feff, aka
ZERO WIDTH NO-BREAK SPACE)
# Josh on 7 Jan 2009 1:00 PM:
wait...which plane? BMP? SMP? SIP? TIP? SSP?
Sorry, I have nothing else useful to contribute here, though I'm a little surprised this problem isn't solved already...?!?
# John Cowan on 7 Jan 2009 4:17 PM:
See http://recycledknowledge.blogspot.com/2005/07/hello-i-am-xml-encoding-sniffer.html for a formal English description of what you have to do to play in the Appendix F leagues.
# ReallyEvilCanine on 8 Jan 2009 12:00 AM:
Mike, you're the only person I know of who pronounces "bee-oh-em" as a word. Cakemakers and codebreakers have every right to say "bomb/bombe" but not we I18Ners.
# Michael S. Kaplan on 8 Jan 2009 12:58 AM:
Dude, you lead a sheltered life. In Unicode and related standards circles, in i18n conversations with developers at Adobe, Apple, IBM, Google, and Microsoft -- it is pronounced as a single word all the time....
# Maurits [MSFT] on 8 Jan 2009 8:58 PM:
Here's my approach:
enum BOM {
BOM_NONE,
BOM_UTF8,
BOM_UTF16LE,
BOM_UTF16BE,
BOM_UTF32BE,
BOM_UTF32LE,
};
HRESULT BOMFromStream(Byte pbBytes[], UINT cbLength, BOM *pBOM) {
if (NULL == pbBytes || NULL == pBOM) {
return E_POINTER;
}
// need at least two bytes for UTF16 BOMs
if (cbLength >= 2) {
if (0xFE == pbBytes[0] && 0xFF == pbBytes[1]) {
*pBOM = BOM_UTF16BE;
return S_OK;
}
if (0xFF == pbBytes[0] && 0xFE == pbBytes[1]) {
*pBOM = BOM_UTF16LE;
return S_OK;
}
}
// need at least three bytes for UTF8 BOM
if (
cbLength >= 3 &&
0xEF == pbBytes[0] &&
0xBB == pbBytes[1] &&
0xBF == pbBytes[2]
) {
*pBOM = BOM_UTF8;
return S_OK;
}
// need at least four bytes for UTF32 BOMs
if (cbLength >= 4) {
if (
0 == pbBytes[0] &&
0 == pbBytes[1] &&
0xFE == pbBytes[2] &&
0xFF == pbBytes[3]
) {
*pBOM = BOM_UTF32BE;
return S_OK;
}
if (
0xFF == pbBytes[0] &&
0xFE == pbBytes[1] &&
0 == pbBytes[2] &&
0 == pbBytes[3]
) {
*pBOM = BOM_UTF32LE;
return S_OK;
}
}
// if we made it this far there's no recognizable BOM
*pBOM = BOM_NONE;
return S_OK;
}
Possible future additional features: sanity check UTF16 byte stream length is even, UTF32 is divisible by 4; advance byte stream by length of BOM.
# Peter Ibbotson on 9 Jan 2009 12:56 PM:
I've had a quick stab in C# I've put a longer version that does appendix F (also a port to C for byte counting purposes) on my blog here:
http://www.ibbotson.co.uk/peteri/index.php?/archives/120-Finding-the-BOM.html
public enum Encoding
{
Unknown = 0, BomBigEndianUcs4, BomUcs4, BomUtf8,
BomUtf16, BomBigEndianUtf16
}
// We use Bit 3 as a end of data marker, true means end
// bit 5 happens to be same value as bit 3
private static byte[] matchData =
{
0x00,0x00,0xF6,0xFF, // 0- 00 00 FE FF Bom UCS4 Big endian
0xF7,0xF6,0x00,0x08, // 4- FF FE 00 00 Bom UCS4 Little endian
0xE7,0xB3,0xBF, // 8- EF BB BF Bom UTF8
0xF7,0xFE, // 12 - FF FE Bom UTF16 Little endian
0xF6,0xFF // 14 - FE FF Bom UTF16 Big endian
};
public static Encoding DetectType(byte[] data)
{
int i = 0;
int offset = 0;
Encoding currentEncoding = Encoding.BomBigEndianUcs4;
while (i < matchData.Length)
{
byte compare = (byte)((matchData[i] & 0xf7) | ((matchData[i] & 0x20) >> 2));
if ((offset >= data.Length) || (data[offset] != compare))
{
offset = 0;
while ((matchData[i] & 0x08) == 0) i++;
currentEncoding++;
}
else
{
if ((matchData[i] & 0x08) == 0x08) return currentEncoding;
offset++;
}
i++;
}
# Khedron on 15 Jan 2009 7:23 PM:
Ok, everyone seems to be just testing for the bytes in order. I realise that I'm posting late, so may have to give up the points race, but here's my version. Auto-calculates the BOM based on endianness and wchar size and compares the char* against that. Except for UTF-8. I gave up on that (its midnight here & I'm going to bed now).
#include <string>
#include <cstring>
// Same-endian: feff
// Different-endian: fffe
enum endianness {
be = -1,
le = 1
};
bool compare_bom_string(int sizeof_wchar, endianness end, const char* data)
{
std::string bom(sizeof_wchar, 0);
int pos = (end==be?3:0);
bom[pos] = '\xFF';
bom[pos+end] = '\xFE';
return !memcmp((void*)data, (void*)bom.c_str(), sizeof_wchar);
}
struct bom {
bom(int i, endianness e) : sizeof_wchar(i), end(e) { }
int sizeof_wchar;
endianness end;
};
int wchar_sizes[] = { 4, 2 };
endianness ends[] = { be, le };
struct ex {
ex(const char*m) : msg(m) { }
const char * what() { return msg; }
const char *msg;
};
bom sniff(const char* data)
{
for (int i = 0; i < sizeof wchar_sizes; ++i)
for (int j = 0; j < sizeof ends; ++j)
if (compare_bom_string(wchar_sizes[i], ends[j], data)) return bom(wchar_sizes[i], ends[j]);
// Just got lazy
const char* utf_8 = "\xEF\xBB\xBF";
if (!memcmp((void*)data, (void*)utf_8, 3)) return bom(1,le);
// Just got lazier
throw ex("Whoops");;
}
# Anonymous on 16 Feb 2009 5:36 AM:
import Maybe
import List
detectBOM s = snd . fromJust $ find ((flip isPrefixOf) s . fst) byteOrderMarks
where byteOrderMarks = [("\xef\xbb\xbf","UTF-8"),
("\x00\x00\xfe\xff","UTF-32BE"),
("\xff\xfe\x00\x00","UTF-32LE"),
("\xfe\xff","UTF-16BE"),
("\xff\xfe","UTF-16LE"),
("\x2b\x2f\x76\x38","UTF-7"),
("\x2b\x2f\x76\x39","UTF-7"),
("\x2b\x2f\x76\x2b","UTF-7"),
("\x2b\x2f\x76\x2f","UTF-7"),
("\xf7\x64\x4c","UTF-1"),
("\xdd\x73\x66\x73","UTF-EBCDIC"),
("\x0e\xfe\xff","SCSU"),
("\xfb\xee\x28","BOCU-1"),
("\x84\x31\x95\x33","GB18030"),
("","NO BOM")]