You may want to rethink your choice of UTF, #3 (Platform?)

by Michael S. Kaplan, published on 2005/05/25 17:20 -04:00, original URI:

Ok, by now you know the drill -- I am comparing the various ways of expressing text in Unicode.

In prior posts I have talked about the issues related to size and to speed. However, both of those posts were working in a theoretical vacuum that was independent of the world in which the code would have to live.

This may be suitable for the internal engine of your component, but once your component has to talk to the world outside, the need to take into account the environment on which the code must rest (in other words, the platform) becomes important.

What is crucial here is that the fastest and best encoding to use for these communications is the "native" type of the platform.

The key is to match that encoding form, whatever it may be.

If you do then all of the native APIs of that platform are available to you, and you maximize performance while minimizing the chance of logical errors corrupting data if you minimize the number of conversions.

If you are running against a Windows platform, then that means you are using UTF-16. Period.

If you are a web service that has to deal with Internet protocols like SOAP and such (or if your platform's Unicode support story happens through UTF-8) then your best bet may be UTF-8.

And if you are running on a UNIX box that uses those four byte code points then UTF-32 is really your only good option.

Now remember that this only refers to those external communications.

If you do extensive string processing in an existing application, then it will often make more sense to leave it in that alternate form and just make sure to use the platform type for when that communication is required. You may also find certain operations to be much more cumbersome in the areas of string process, formatting, and parsing. If that is the case, then youyr internal engine might be using UTF-16 or UTF-32, even if the underlying platform's default type is not.

This may also be the case if you create cross-platform libraries -- it may prove to be an unmaintainable mess to try make the underlying identity of strings change on different platform compiles, but it would make perfect sense for the library to use one for all of its internal code....


This post brought to you by "∂" (U+2202, a.k.a. PARTIAL DIFFERENTIAL)

no comments

referenced by

2010/11/24 UTF-8 on a platform whose support is overwhelmingly, almost oppressively, UTF-16

go to newer or older post, or back to index or month or day