You may want to rethink your choice of UTF, #1 (if the size matters)

by Michael S. Kaplan, published on 2005/05/20 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/20/420317.aspx


The decision of which UTF (Unicode Transformation Format) to use might be driven by many different factors, so I will be doing a few different posts on this topic, looking at the factors individually.

This first post will talk about the size considerations.

(I am assuming you will always want to use Unicode of some form, but I will include the code page answer as an initial strawman each time.)

Now obviously if everything you want to represent will fit into a legacy code page, then that will always be the smallest answer (one or at most two bytes for all ACPs). There are other good reasons to not like this idea, which will be covered further in a future post in this series.

The first Unicode Transformation format I will talk About is UTF-32. It has the advantage of always being a single fixed size for every code point in Unicode from U+0000 to U+10ffff. The cost is that for a standard that never uses more than 21 bits of information, every single character uses 32 bits. There will be other good scenarios to consider UTF-32 that I will cover on other days, but today when the conversation is purely on the merits of size, UTF-32 will often not be the choice.

The second one I will talk about is UTF-16. It has the advantage of usually being a single fixed size while also managing to take up less space per character (16 bits) than the full 21 bits needed for all codepoints. If supplementary characters are used then they will need 32 bits (for the surrogate pair, each code point of which takes up 16 bits), so while it is no worse than UTF-32 it is no better for an extensive amount of these supplementary characters.

The third one I will talk about is UTF-8. It is a true multibyte form, which takes up space as follows:

If you look at the Unicode Blocks, you can see what each of these ranges will cover, and how expensive different languages will be from a size perspective.

Clearly, most East Asian, Southeast Asian, South Asian, and Historic scripts will be more costly from a size perspective, being at best 3 bytes and at worst 4 bytes in size. Again there are times that size may be an issue, but there is never a time that the varying size across so many different ranges is an advantage. There are times that UTF-8 will be best, but beyond ASCII that will never be based purely on size.

The last form I will talk about is CESU-8, a form defined for people who wanted binary compatibility with UTF-16 using all other characteristics of UTF-8. Its space is taken up as follows, ue to the fact that supplementary characters are encoded as surrogate pairs:

This form will obviously never win any battle based on size. There may be times that it is a good choice, but none that this post will recognize -- check us out another day. :-)

In future posts, other considerations such as platform, endian issues, string processing, compression capabilities, and more will be covered. If you could limit comments to the size issue and hold off on the other issues until the relevant post come up, it would save me having to delete comments until they are relevant and I would appreciate it. :-)

 

This post brought to you by "рда" (U+0920, a.k.a. DEVANAGARI LETTER TTHA)
(A character that is very sensitive about its weight and prefers to sleek 2-byte size provided by UTF-16 over the 3- and 4-byte size provided by UTF-8 and UTF-32, respectively)


# Rosyna on 20 May 2005 12:29 AM:

Kind of confused by your size ratings on UTF-8. Some odd Hangul characters require around 16 or so bytes to be decomposed into UTF-8.

# Michael S. Kaplan on 20 May 2005 12:48 AM:

I am referring to code points only -- not characters or text elements or grapheme clusters, etc.

# Rosyna on 20 May 2005 10:49 AM:

Ah, wasn't sure because you kept using the word "character".

"The cost is that for a standard that never uses more than 21 bits of information, every single character uses 32 bits."

And sometimes you seem to use codepoint and character almost interchangeably:

"It has the advantage of usually being a single fixed size while also managing to take up less space per character (16 bits) than the full 21 bits needed for all codepoints."

# Michael S. Kaplan on 20 May 2005 11:32 AM:

Yes, this post suffers from treating the two as being the same always, kind of like an API level point of view where a cch (count of characters) is really a ccp (count of code points).

Sloppy, but it kind of makes sense.... :-)

referenced by

2010/11/24 UTF-8 on a platform whose support is overwhelmingly, almost oppressively, UTF-16

2007/07/26 No UTF-8 in a VARCHAR column

2005/05/25 You may want to rethink your choice of UTF, #3 (Platform?)

2005/05/24 Encoding scheme, encoding form, or other

2005/05/22 You may want to rethink your choice of UTF, #2 (Speed of operations)

go to newer or older post, or back to index or month or day