You may want to rethink your choice of UTF, #1 (if the size matters)

by Michael S. Kaplan, published on 2005/05/20 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/20/420317.aspx

The decision of which UTF (Unicode Transformation Format) to use might be driven by many different factors, so I will be doing a few different posts on this topic, looking at the factors individually.

(I am assuming you will always want to use Unicode of some form, but I will include the code page answer as an initial strawman each time.)

Now obviously if everything you want to represent will fit into a legacy code page, then that will always be the smallest answer (one or at most two bytes for all ACPs). There are other good reasons to not like this idea, which will be covered further in a future post in this series.

The first Unicode Transformation format I will talk About is UTF-32. It has the advantage of always being a single fixed size for every code point in Unicode from U+0000 to U+10ffff. The cost is that for a standard that never uses more than 21 bits of information, every single character uses 32 bits. There will be other good scenarios to consider UTF-32 that I will cover on other days, but today when the conversation is purely on the merits of size, UTF-32 will often not be the choice.

The second one I will talk about is UTF-16. It has the advantage of usually being a single fixed size while also managing to take up less space per character (16 bits) than the full 21 bits needed for all codepoints. If supplementary characters are used then they will need 32 bits (for the surrogate pair, each code point of which takes up 16 bits), so while it is no worse than UTF-32 it is no better for an extensive amount of these supplementary characters.

The third one I will talk about is UTF-8. It is a true multibyte form, which takes up space as follows:

If you look at the Unicode Blocks, you can see what each of these ranges will cover, and how expensive different languages will be from a size perspective.

Clearly, most East Asian, Southeast Asian, South Asian, and Historic scripts will be more costly from a size perspective, being at best 3 bytes and at worst 4 bytes in size. Again there are times that size may be an issue, but there is never a time that the varying size across so many different ranges is an advantage. There are times that UTF-8 will be best, but beyond ASCII that will never be based purely on size.

The last form I will talk about is CESU-8, a form defined for people who wanted binary compatibility with UTF-16 using all other characteristics of UTF-8. Its space is taken up as follows, ue to the fact that supplementary characters are encoded as surrogate pairs:

This form will obviously never win any battle based on size. There may be times that it is a good choice, but none that this post will recognize -- check us out another day. :-)

In future posts, other considerations such as platform, endian issues, string processing, compression capabilities, and more will be covered. If you could limit comments to the size issue and hold off on the other issues until the relevant post come up, it would save me having to delete comments until they are relevant and I would appreciate it. :-)

This post brought to you by "ठ" (U+0920, a.k.a. DEVANAGARI LETTER TTHA)
(A character that is very sensitive about its weight and prefers to sleek 2-byte size provided by UTF-16 over the 3- and 4-byte size provided by UTF-8 and UTF-32, respectively)

Kind of confused by your size ratings on UTF-8. Some odd Hangul characters require around 16 or so bytes to be decomposed into UTF-8.

I am referring to code points only -- not characters or text elements or grapheme clusters, etc.

Ah, wasn't sure because you kept using the word "character".

"The cost is that for a standard that never uses more than 21 bits of information, every single character uses 32 bits."

And sometimes you seem to use codepoint and character almost interchangeably:

"It has the advantage of usually being a single fixed size while also managing to take up less space per character (16 bits) than the full 21 bits needed for all codepoints."

Yes, this post suffers from treating the two as being the same always, kind of like an API level point of view where a cch (count of characters) is really a ccp (count of code points).

Sloppy, but it kind of makes sense.... :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.