by Michael S. Kaplan, published on 2005/09/10 10:15 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/10/463371.aspx
REQUIRED DISCLAIMER: This post is almost certainly not what you think it is about after you saw its title. The blatant attempt to increase the size of this disclaimer to hide the actual content from the blogs.msdn.com title page filter should be the best possible clue to this fact. Consider yourself warned.... :-)
It is amazing how often the size of Unicode is brought up as one of the points against it.
There are those people who are happy with their single byte code page despite the fact that code pages are really not enough, and they feel that moving to Unicode would thus double their storage size.
There are those people who feel that Indic scripts are penalized for requiring what they think of as letters to require multiple code points, and that they consider this neo-imperalistic chauvanism at the hands of a bunch of US-based companies to be an offensive and expensive example of ignorance.
There are those people who look at the way UTF-8 tries to penalize some languages by causing them to need three or even four bytes per character while others fit in two (or even one if it is English).
There are those people who think that UTF-16 tries to penalize ancient scripts (which is to most okay, usage is not as common) and virtually all of the newly encoded ideographs (which is to many not okay, as it is not all ancient usage and this is taken as a form of punishment for encoding data from HKSCS and JIS, to give two examples).
There are those who see the simple solution as moving to UTF-32 because then every character is equal in size. Even though this ignores the simple fact that only those who are either unintentionally ignorant of language issues or willfully obtuse about language usage would assume that any of these verifiably false statements is true:
Over on The Language Log, Mark Lieberman had some fascinating stats that he posted last month in his post One world, how many bytes?
By Just comparing English and Chinese and without even getting into the Unicode side of these arguments that I started this inappropriately titled post with, Mark manages to refute most of the bulleted points. And he notes that while the ratios vary, Chinese will take up less space than the English translations thereof (and the converse of this may even better proof of the point!).
Interestingly and somewhat perversely, if you combine the information in his post with the original points at the top of mine, it makes a good case for the notion that causing Han ideographs to need more bytes per code point may actually serve to equalize a bias against non-ideographic scripts, which clearly need more space to represent the same information!
All of which ignores a point that Unicode Presdent Mark Davis reminded us of some time ago on the Unicode List, the point that in the world of the web and HTML and XML that large parts of the markup data in these formats are all single bytes in UTF-8. And further that the common costs of transmission of data are more in the binary pieces like pictures, which dwarf the size of the text, anyway.
But in the end, I am willing to allow for the fact that size may matter, if it will cause these cranks to stop arguing the point. However, they will need to prove to me that they have navigated the rocky shoals of trying to determine how on earth they can measure what the size is, enough to prove any point at all. Since to date every argument I have ever heard about problems with size in Unicode has been dead wrong and while occasionally it is brilliantly so, usually it is quite simply not.
Until then (when it comes to text) we can stop likely stop freaking out about size. Because clearly none of the people making the argument have a good enough understanding of how to use it, anyway.
Hmmm. If that is the point then maybe I was wrong with my original disclaimer? :-)
# Mihai on 11 Sep 2005 5:18 PM:
# Michael S. Kaplan on 12 Sep 2005 7:12 AM:
go to newer or older post, or back to index or month or day