What is the difference between Big Endian and Little Endian Unicode?

by Michael S. Kaplan, published on 2005/02/09 13:03 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/09/369958.aspx


A very common question that comes up has much to do with the meaning of the suffixes in UTF-16LE and UTF-16BE.

It all comes back to the way processors work. When you look at a byte (like 0x41) it is easy to say you know what it is. But when looking at two bytes in a row (like 0x41 0x00) as if it were a single 16-bit WORD you have to decide if you are looking at the number 0x4100 or the number 0x0041.

I always found the clearest description came from Bruce McKinney's Hardcore Visual Basic:

Endian refers to the order in which bytes are stored. The term is taken from a story in Gulliver’s Travels by Jonathan Swift about wars fought between those who thought eggs should be cracked on the Big End and those who insisted on the Little End. With chips, as with eggs, it doesn’t really matter as long as you know which end is up.

And indeed, it is pretty crucial to know which end is up. This is especially interesting for UTF-16, which in the end is a bunch of arrays of WORDs that happen to correspond to characters in Unicode. The difference between U+0041 ("A", a.k.a. LATIN CAPITAL LETTER A) and U+4100 ("䄀", a.k.a. an ideograph in CJK Extension A that refers to calamity, disaster, evil, or misfortune) is quite striking!

On Windows platforms, which are mostly little endian, UTF-16LE is just called "Unicode" and UTF-16BE is just called "Unicode (Big Endian)". Which is much less confusing for the majority of people who do not work cross-platform.

(Speaking frankly, this does not bother me much -- anyone smart enough to be annoyed by the terminology is smart enough to know that not everyone is as smart as they are in these matters)

For more information, simple web searches with the following search string:

"big endian" "little endian"

will return enough results to keep one busy for some time...

 

This post brought to you by "䄀" (U+4100, a.k.a. an ideograph in CJK Extension A that refers to calamity, disaster, evil, or misfortune)


# Stewie on 9 Feb 2005 12:23 PM:

Put me down, you Brobdingnagian blunderbuss!

(before you delete this, this is at least tangentially on topic!)

# Michael Kaplan on 9 Feb 2005 2:31 PM:

I posted it, Stewie. :-)

# Ryan Myers on 9 Feb 2005 4:31 PM:

Byte Order Mark time! Unfortunately, there's absolutely no standardization on when to use it, just the convention that if you don't encounter it you should assume the host endianness. And it's even funnier when you encounter BOMs in mid-text, such as when you've used cat to combine two files produced on machines of different endianness... oh, and if you're transcoding, under what conditions do you prefix, or remove, the BOM?

# Michael Kaplan on 9 Feb 2005 4:35 PM:

Luckily, reports of problems are fairly overblown. :-)

If you concatenate two files then sure you *ought* to remove it, but if you proceed without removing it than all that happens is that an invisible character with zero width is there -- which does not matter.

If they are of different endianness and a tool combines them then thast is a bug for the tool -- as you should never combine two such files.

aaaa on 17 Jan 2010 7:20 AM:

"Little Endian" means that the lower-order byte of the number is stored in memory at the lowest address, and the high-order byte at the highest address. For example, a 4 byte Integer

Byte3 Byte2 Byte1 Byte0

will be arranged in memory as follows:

Base Address+0 Byte0
Base Address+1 Byte1
Base Address+2 Byte2
Base Address+3 Byte3

Intel processors (those used in PC's) use "Little Endian" byte order.

"Big Endian" means that the high-order byte of the number is stored in memory at the lowest address, and the low-order byte at the highest address. The same 4 byte integer would be stored as:

Base Address+0 Byte3
Base Address+1 Byte2
Base Address+2 Byte1
Base Address+3 Byte0

Motorola processors (those used in Mac's) use "Big Endian" byte order.

Michael S. Kaplan on 17 Jan 2010 10:29 AM:

A little late, no? :)

Also note that Macs are moving to Intel non-Motorola these days...

m bilal javed on 27 Apr 2011 9:36 PM:

which is greater in number little or big endian

Michael S. Kaplan on 28 Apr 2011 1:52 AM:

Neither is "bigger" -- they represent the same number, represented differently in how it is encoded.

Diane on 10 Feb 2012 9:56 AM:

How can I open a file once I have saved it as unicode big endian?

Michael S. Kaplan on 13 Feb 2012 7:36 AM:

Diane, see the answer here.


referenced by

2012/02/13 "Now that its been saved, how do I open it?"

2005/09/11 unicodeFFFE... is Microsoft off its rocker?

2005/05/24 Encoding scheme, encoding form, or other

go to newer or older post, or back to index or month or day