by Michael S. Kaplan, published on 2005/05/22 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/22/420822.aspx
In the first post in this series, I talked about size considerations, but even then I hinted that there is more to the decision than just size. I mean, I am not saying that size does not matter, but there are other facts that can matter more....
In this second post, I am going to talk about the speed of operations.
(As before, I am assuming you will always want to use Unicode of some form, but I will include the code page answer as an initial strawman each time.)(I am assuming you will always want to use Unicode of some form, but I will include the code page answer as an initial strawman each time.)
Starting with our strawman of the legacy code page, if everything you need to do will fit in one single-byte code page, then often it will be the winner here. Unluckily for the code pages, there is no language for which that solution is good enough except maybe perhaps English. This is a point covered many times in this blog, in posts such as Code pages are really not enough....
Of course if you need to work with a multibyte code page then this solution is clearly a big loser -- the world of lead bytes, trail bytes, IsDBCSLeadByte API calls, and more is a truly unhappy one, in many senses (I lived in that world for a time, and while I would not think of it as hell per se, in many senses it will live on in my memory as utter pandemonium). And this is even before considering the fact that every single MBCS code page is officially infufficient to cover all of the languages.
I'll split out GB-18030 from this mix, since clearly it is sufficient. However, it has the same problem with lead bytes and trail bytes, to which you can add the logic for the four-byte sequences. I might wish such a processing effort on a developer I did not like very much, but only if I were in a really bad mood.
Moving on to UTF-32, obviously it is the uber-choice for ease of operations with code points since every code point has the exact same size. However, speed is also a factor of working set so this ease (leading to development "speed") must be balanced with the larger memory footprint that could affect the speed of any operations that are done, depending on the platform (this will be covered more another day).
UTF-16 is just as good of a choice as UTF-32 if you do not need supplementary characters (Unicode code points from U+10000 to U+10ffff). With the bonus of a smaller memory footprint. If you need to extensively use supplementary characters, however, you do lose some of the speed in order to do range checking and such.
If you like, you can borrow the following definitions, which will be available in the winnls.h header file in the upcoming Platform SDK release for Longhorn Beta 1 (just remember to remove yours if you get new SDK header so you can avoid the "duplicate definition" compile errors!):
#define HIGH_SURROGATE_START 0xd800
#define HIGH_SURROGATE_END 0xdbff
#define LOW_SURROGATE_START 0xdc00
#define LOW_SURROGATE_END 0xdfff
#define IS_HIGH_SURROGATE(wch) (((wch) >= HIGH_SURROGATE_START) && ((wch) <= HIGH_SURROGATE_END))
#define IS_LOW_SURROGATE(wch) (((wch) >= LOW_SURROGATE_START) && ((wch) <= LOW_SURROGATE_END))
#define IS_SURROGATE_PAIR(hs, ls) (IS_HIGH_SURROGATE(hs) && IS_LOW_SURROGATE(ls))
These additions are actually directly due to customer feedback (both internal and external to Microsoft) in this area -- these macros answer basic questions about supplmentary character processing in UTF-16, and it really does not make sense to force every developer to define their own.
The header file also includes a cool comment block with sample conversions between UTF-32 and UTF-16, which for single code points or surrogate pairs could perhaps also be put in a macro but there has not really been as much demand for that, yet. Obviously there will be future releases for such things to be considered, if the demand picks up. :-)
Now I already spake volumes about UTF-8 speed of operations yesterday with the simple pseudo-interview question post Getting exactly ONE Unicode code point out of UTF-8. There may be good reasons to go to UTF-8 sometimes (I will talk bout them more later in the series) but speed of operations on code points is not a time that it will really shine.
These is not much to say about CESU-8 in this context -- it essentially took everything that stinks about UTF-16 processing and everything that stinks about UTF-8 processing and combined them together. You might need to do some actual performance tests to see which would be worse for processing: MBCS code pages or CESU-8.
In future posts, other considerations such as platform, endian issues, compression capabilities, and more will be covered. If you could limit comments to the speed of operations issue and hold off on the other issues until the relevant post come up, it would save me having to delete comments until they are relevant and I would appreciate it. :-)
This post brought to you by "₡" (U+20a1, a.k.a. COLON SIGN)
# MGrier on 22 May 2005 2:21 PM:
# Michael S. Kaplan on 22 May 2005 4:24 PM:
# Michael Dunn_ on 24 May 2005 12:46 PM:
# Michael S. Kaplan on 24 May 2005 2:06 PM:
Dallas on 20 Aug 2008 5:56 AM:
If UTF-8 is commonly called Multibyte, and UTF-16 is commonly called wide character, what common name do we give UTF-32? So do we have mbstowcs() and mbsto??s()
If in a UFT-8 VC++ source file, UTF-16 strings use L"" notation, how to represent UTF-32 strings?
In a Unicode VC++ source file (UTF-16), how do you stop Unicode character in the ASCII strings? Is there an A"" notation?
Even though UTF-32 exists, and when we meet up with aliens, UTF-64 may exist too, but why should we use them commonly in our programming environment? The font files will be enormous. Why not just have a different font for each language and rather than an BOM, have a language marker?
2005/05/24 Encoding scheme, encoding form, or other
go to newer or older post, or back to index or month or day