You may want to rethink your choice of UTF, #2 (Speed of operations)

by Michael S. Kaplan, published on 2005/05/22 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/22/420822.aspx

In the first post in this series, I talked about size considerations, but even then I hinted that there is more to the decision than just size. I mean, I am not saying that size does not matter, but there are other facts that can matter more....

(As before, I am assuming you will always want to use Unicode of some form, but I will include the code page answer as an initial strawman each time.)(I am assuming you will always want to use Unicode of some form, but I will include the code page answer as an initial strawman each time.)

Starting with our strawman of the legacy code page, if everything you need to do will fit in one single-byte code page, then often it will be the winner here. Unluckily for the code pages, there is no language for which that solution is good enough except maybe perhaps English. This is a point covered many times in this blog, in posts such as Code pages are really not enough....

Of course if you need to work with a multibyte code page then this solution is clearly a big loser -- the world of lead bytes, trail bytes, IsDBCSLeadByte API calls, and more is a truly unhappy one, in many senses (I lived in that world for a time, and while I would not think of it as hell per se, in many senses it will live on in my memory as utter pandemonium). And this is even before considering the fact that every single MBCS code page is officially infufficient to cover all of the languages.

I'll split out GB-18030 from this mix, since clearly it is sufficient. However, it has the same problem with lead bytes and trail bytes, to which you can add the logic for the four-byte sequences. I might wish such a processing effort on a developer I did not like very much, but only if I were in a really bad mood.

Moving on to UTF-32, obviously it is the uber-choice for ease of operations with code points since every code point has the exact same size. However, speed is also a factor of working set so this ease (leading to development "speed") must be balanced with the larger memory footprint that could affect the speed of any operations that are done, depending on the platform (this will be covered more another day).

UTF-16 is just as good of a choice as UTF-32 if you do not need supplementary characters (Unicode code points from U+10000 to U+10ffff). With the bonus of a smaller memory footprint. If you need to extensively use supplementary characters, however, you do lose some of the speed in order to do range checking and such.

If you like, you can borrow the following definitions, which will be available in the winnls.h header file in the upcoming Platform SDK release for Longhorn Beta 1 (just remember to remove yours if you get new SDK header so you can avoid the "duplicate definition" compile errors!):

These additions are actually directly due to customer feedback (both internal and external to Microsoft) in this area -- these macros answer basic questions about supplmentary character processing in UTF-16, and it really does not make sense to force every developer to define their own.

The header file also includes a cool comment block with sample conversions between UTF-32 and UTF-16, which for single code points or surrogate pairs could perhaps also be put in a macro but there has not really been as much demand for that, yet. Obviously there will be future releases for such things to be considered, if the demand picks up. :-)

Now I already spake volumes about UTF-8 speed of operations yesterday with the simple pseudo-interview question post Getting exactly ONE Unicode code point out of UTF-8. There may be good reasons to go to UTF-8 sometimes (I will talk bout them more later in the series) but speed of operations on code points is not a time that it will really shine.

These is not much to say about CESU-8 in this context -- it essentially took everything that stinks about UTF-16 processing and everything that stinks about UTF-8 processing and combined them together. You might need to do some actual performance tests to see which would be worse for processing: MBCS code pages or CESU-8.

In future posts, other considerations such as platform, endian issues, compression capabilities, and more will be covered. If you could limit comments to the speed of operations issue and hold off on the other issues until the relevant post come up, it would save me having to delete comments until they are relevant and I would appreciate it. :-)

I very strongly suspect that UTF-8 and/or using custom code pages is almost always faster because of the decreased memory footprint.

Cache lines are typically 8 bytes lately I believe, so my guess is that the only time utf-32 always wins is when you only have strings with 1-2 characters. Utf-16: 2-4. Utf8 is actually harder because there's greater variance based on the input size; if all your characters are in the supplementary range (and thus took 4-5 byte encodings) you'll lose but almost always otherwise, given a good enough utf-8 decoding implementation, you'll win with utf-8.

A good utf-8 decoding implementation that takes advantage of cache lines may or may not be easy depending on the compiler and CPU.

This sounds like a fun experiment to try some time.

Note that this is a statistically based argument - you can certainly find data where any one of the encodings is clearly "the best". If bounded/guaranteed worst case time is more important than better typical time, you should definitely use UTF-32.

Hmmm.... I am not sure I would agree with that assessment, especially given the fact that most of UTF-8 is bigger than UTF-16 (and almost all of what is not is the same size), and most of the MBCS code pages are the same size but require more operations to be done on them.

So they lose on both size and speed in operations, on all but the ASCII scenarios. So if it isn't English, both of them will cost....

UTF-16 has an advantage over every other encoding in that you can pass UTF-16 strings directly to Win32 APIs. If you use another encoding, you'll have to spend time converting your strings to/from UTF-16 before/after calling APIs.

If UTF-8 is commonly called Multibyte, and UTF-16 is commonly called wide character, what common name do we give UTF-32? So do we have mbstowcs() and mbsto??s()

If in a UFT-8 VC++ source file, UTF-16 strings use L"" notation, how to represent UTF-32 strings?

In a Unicode VC++ source file (UTF-16), how do you stop Unicode character in the ASCII strings? Is there an A"" notation?

Even though UTF-32 exists, and when we meet up with aliens, UTF-64 may exist too, but why should we use them commonly in our programming environment? The font files will be enormous. Why not just have a different font for each language and rather than an BOM, have a language marker?