Does size matter? And if so, how do you measure it?

by Michael S. Kaplan, published on 2005/09/10 10:15 -04:00, original URI:

REQUIRED DISCLAIMER: This post is almost certainly not what you think it is about after you saw its title. The blatant attempt to increase the size of this disclaimer to hide the actual content from the title page filter should be the best possible clue to this fact. Consider yourself warned.... :-)

It is amazing how often the size of Unicode is brought up as one of the points against it.

There are those people who are happy with their single byte code page despite the fact that code pages are really not enough, and they feel that moving to Unicode would thus double their storage size.

There are those people who feel that Indic scripts are penalized for requiring what they think of as letters to require multiple code points, and that they consider this neo-imperalistic chauvanism at the hands of a bunch of US-based companies to be an offensive and expensive example of ignorance.

There are those people who look at the way UTF-8 tries to penalize some languages by causing them to need three or even four bytes per character while others fit in two (or even one if it is English).

There are those people who think that UTF-16 tries to penalize ancient scripts (which is to most okay, usage is not as common) and virtually all of the newly encoded ideographs (which is to many not okay, as it is not all ancient usage and this is taken as a form of punishment for encoding data from HKSCS and JIS, to give two examples).

There are those who see the simple solution as moving to UTF-32 because then every character is equal in size. Even though this ignores the simple fact that only those who are either unintentionally ignorant of language issues or willfully obtuse about language usage would assume that any of these verifiably false statements is true:

Over on The Language Log, Mark Lieberman had some fascinating stats that he posted last month in his post One world, how many bytes?

By Just comparing English and Chinese and without even getting into the Unicode side of these arguments that I started this inappropriately titled post with, Mark manages to refute most of the bulleted points. And he notes that while the ratios vary, Chinese will take up less space than the English translations thereof (and the converse of this may even better proof of the point!).

Interestingly and somewhat perversely, if you combine the information in his post with the original points at the top of mine, it makes a good case for the notion that causing Han ideographs to need more bytes per code point may actually serve to equalize a bias against non-ideographic scripts, which clearly need more space to represent the same information!

All of which ignores a point that Unicode Presdent Mark Davis reminded us of some time ago on the Unicode List, the point that in the world of the web and HTML and XML that large parts of the markup data in these formats are all single bytes in UTF-8. And further that the common costs of transmission of data are more in the binary pieces like pictures, which dwarf the size of the text, anyway.

But in the end, I am willing to allow for the fact that size may matter, if it will cause these cranks to stop arguing the point. However, they will need to prove to me that they have navigated the rocky shoals of trying to determine how on earth they can measure what the size is, enough to prove any point at all. Since to date every argument I have ever heard about problems with size in Unicode has been dead wrong and while occasionally it is brilliantly so, usually it is quite simply not.

Until then (when it comes to text) we can stop likely stop freaking out about size. Because clearly none of the people making the argument have a good enough understanding of how to use it, anyway.

Hmmm. If that is the point then maybe I was wrong with my original disclaimer? :-)

# Mihai on 11 Sep 2005 5:18 PM:

This is indeed one of the arguments I often hear against making a full Unicode application.

I agree it might be relevant for some TB databases, but not for the strings of the software UI or for the files saved by an application.
First the strings are stored as Unicode in the resources, even if the application is not Unicode (and you might get minor performance penalty for converting to ANSI).

But for size, I have collected all the strings in an application, converted them to Unicode, only to discover that all this was smaller than the true-color, fancy splash-screen!

Duh! Saving space!

# Michael S. Kaplan on 12 Sep 2005 7:12 AM:

Indeed Mihai, it is amazing how often people go on about how important it is to avoid Unicode because of the size issue....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2006/06/13 Size matters (when it comes to day names, at least)

go to newer or older post, or back to index or month or day