Encoding scheme, encoding form, or other

by Michael S. Kaplan, published on 2005/05/24 03:21 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/24/421343.aspx


No one ever accused the Universal Character Set of being simple.

Just short of 100,000 characters, many different scripts and languages, all sorts of complex scripts.

Unicode is downright hard, sometimes.

If you asked me, that is the biggest reason for Microsoft to just call it Unicode, rather than Unicode Tranformation Format-16 bit, Little Endian. Because that keeps it a bit simpler for people who do not have to care about that level of detail.

And now I am going to ruin all that for a bit.

Who am I? The party pooper! :-)

Now first of all, there are three different Unicode forms: UTF-8, UTF-16, and UTF-32. Those are the only recognized legal Unicode forms. No matter how many times UTF-9 is published as an RFC, its April 1st publish date will always give it away. I have been comparing various issues between the Unicode forms like the size and the speed over the past few days. The Unicode forms are actually good descriptions of the way Unicode is represented.

The next level takes us into the way that those forms are actually stored on disk or in memory if you are literally looking at byte entries. These are known as the Unicode schemes and there are five of them:

Now I have talked about the whole Endian thing in the past, and there will be one of those Jeff Foxworthy-esque You may want to rethink your choice of UTF posts soon that talks about Endian issues, RSN (real soon now).

For now, you can just realize that Unicode is legislating that a USHORT and a UINT have different byte orders on different platforms, similar to the way that a government could legislate gravity if they wanted to -- they are recognizing what platforms do and just trying to formally describe them. :-)

Now there is also CESU-8, which I briefly discussed when I was talking about size and speed a few days ago. It is not an encoding form and it is not, strictly speaking. an encoding scheme in the formal recognized sense like the five entries above. Although to make life more confusing for everyone, the full name of CESU-8 is Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8). It's status is defined in the summary of the Technical Report:

This document specifies an 8-bit Compatibility Encoding Scheme for UTF-16 (CESU) that is intended for internal use within systems processing Unicode in order to provide an ASCII-compatible 8-bit encoding that is similar to UTF-8 but preserves UTF-16 binary collation. It is not intended nor recommended as an encoding used for open information exchange. The Unicode Consortium, does not encourage the use of CESU-8, but does recognize the existence of data in this encoding and supplies this technical report to clearly define the format and to distinguish it from UTF-8. This encoding does not replace or amend the definition of UTF-8.

So far it has not done well in contrast to the official Unicode forms, and that will likely not improve in future posts. Sorry, but those are the breaks....

Now there is also UTF-EBCDIC, which, to once again help confuse all, is described in the summary as an "EBCDIC Friendly Unicode (or UCS) Transformation Format" (confusing beacuse it is yet another UTF that in this case is called neither a form nor a scheme!). Luckily the scope section defines where it ought to be used: "Neither UTF-EBCDIC nor its intermediate form called UTF-8-Mod in this technical report, are intended to be used in open interchange environments. It is useful in homogeneous EBCDIC systems and networks". Which kind of says it all.

Now both CESU-8 and UTF-EBCDIC should probably have been Unicode Technical Notes, and some people have even pointed that out. You can tell people "my dear boy..." when you explain to them that UTNs did not formally exist when these two proposals were either approve or on track to be approved. Maybe next time?

In any case, Microsoft does not support either of these formats, as they are not intended for interchange, and none of its internal processes use them. Nothing personal, they are just really not MS formats.

 

This post is brought to you by "۩" (U+06e9, a.k.a. ARABIC PLACE OF SAJDAH)


# Ruben on 24 May 2005 5:10 PM:

Don't forget about UTF-7! No, wait, I take that back ... DO forget about UTF-7 ;-)

# Michael S. Kaplan on 25 May 2005 8:19 AM:

Ah yes, UTF-7 can go in that same category of transformation formats without portfolio as officially sanctioned form or scheme.

Probably is best to forget about, at this point. Though it is supported by Windows.... :-)

go to newer or older post, or back to index or month or day