by Michael S. Kaplan, published on 2005/02/09 13:03 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/09/369958.aspx
A very common question that comes up has much to do with the meaning of the suffixes in UTF-16LE and UTF-16BE.
It all comes back to the way processors work. When you look at a byte (like 0x41) it is easy to say you know what it is. But when looking at two bytes in a row (like 0x41 0x00) as if it were a single 16-bit WORD you have to decide if you are looking at the number 0x4100 or the number 0x0041.
I always found the clearest description came from Bruce McKinney's Hardcore Visual Basic:
Endian refers to the order in which bytes are stored. The term is taken from a story in Gulliver’s Travels by Jonathan Swift about wars fought between those who thought eggs should be cracked on the Big End and those who insisted on the Little End. With chips, as with eggs, it doesn’t really matter as long as you know which end is up.
And indeed, it is pretty crucial to know which end is up. This is especially interesting for UTF-16, which in the end is a bunch of arrays of WORDs that happen to correspond to characters in Unicode. The difference between U+0041 ("A", a.k.a. LATIN CAPITAL LETTER A) and U+4100 ("䄀", a.k.a. an ideograph in CJK Extension A that refers to calamity, disaster, evil, or misfortune) is quite striking!
On Windows platforms, which are mostly little endian, UTF-16LE is just called "Unicode" and UTF-16BE is just called "Unicode (Big Endian)". Which is much less confusing for the majority of people who do not work cross-platform.
(Speaking frankly, this does not bother me much -- anyone smart enough to be annoyed by the terminology is smart enough to know that not everyone is as smart as they are in these matters)
For more information, simple web searches with the following search string:
"big endian" "little endian"
will return enough results to keep one busy for some time...
This post brought to you by "䄀" (U+4100, a.k.a. an ideograph in CJK Extension A that refers to calamity, disaster, evil, or misfortune)
# Stewie on 9 Feb 2005 12:23 PM:
# Michael Kaplan on 9 Feb 2005 2:31 PM:
# Ryan Myers on 9 Feb 2005 4:31 PM:
# Michael Kaplan on 9 Feb 2005 4:35 PM:
aaaa on 17 Jan 2010 7:20 AM:
"Little Endian" means that the lower-order byte of the number is stored in memory at the lowest address, and the high-order byte at the highest address. For example, a 4 byte Integer
Byte3 Byte2 Byte1 Byte0
will be arranged in memory as follows:
Base Address+0 Byte0
Base Address+1 Byte1
Base Address+2 Byte2
Base Address+3 Byte3
Intel processors (those used in PC's) use "Little Endian" byte order.
"Big Endian" means that the high-order byte of the number is stored in memory at the lowest address, and the low-order byte at the highest address. The same 4 byte integer would be stored as:
Base Address+0 Byte3
Base Address+1 Byte2
Base Address+2 Byte1
Base Address+3 Byte0
Motorola processors (those used in Mac's) use "Big Endian" byte order.
Michael S. Kaplan on 17 Jan 2010 10:29 AM:
A little late, no? :)
Also note that Macs are moving to Intel non-Motorola these days...
m bilal javed on 27 Apr 2011 9:36 PM:
which is greater in number little or big endian
Michael S. Kaplan on 28 Apr 2011 1:52 AM:
Neither is "bigger" -- they represent the same number, represented differently in how it is encoded.
Diane on 10 Feb 2012 9:56 AM:
How can I open a file once I have saved it as unicode big endian?
Michael S. Kaplan on 13 Feb 2012 7:36 AM:
Diane, see the answer here.
2012/02/13 "Now that its been saved, how do I open it?"
2005/09/11 unicodeFFFE... is Microsoft off its rocker?
2005/05/24 Encoding scheme, encoding form, or other
go to newer or older post, or back to index or month or day