It is easy (and obnoxious) to claim "size doesn't matter" if one has the size everyone wants

by Michael S. Kaplan, published on 2010/07/21 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/07/21/10036699.aspx

In the realm of international politics (one of my favorite people in this or any other world is cringing as she reads me use that phrase but I promise I'll be good!), there are several different philosophies that can guide the way one does the work.

also known as basic fairness, where everybody is pretty much treated equally and contributes as much no matter who you are, when you showed up, how much you need, etc.

Now this idea has some significant limitations as it means people who have less are forced to contribute just as much even if they don't have enough or if the problems were not of their own doing.

which addresses that some by making sure that who pays in and how has a lot more to do with who can contribute and who did the most to require people to contribute.

Now both of these try to be fair, but in different ways. There are specific times that each might be better than the other.

and although popular, it has some drawbacks, especially in the situations where there is a specific cost to being late to the game.

You could, by and large, think of Unicode and ISO/IEC 10646 in terms of international politics. It certainly involves international and can often be quite political! :-)

Now every script gets to play, which kind of fits in the share and share alike philosophy, and some scripts need more effort than others but those resources are often allocated which fits well into differentiated responsibility a little bit.

But for the most part, if you are one of the scripts whose code points are allocated after 0x7ff fo UTF-8 or after U+FFFF for UTF-8/UTF-16, it will cost you.

The former puts you into three-bytes per Unicode code point land, and the latter puts you in four-bytes per Unicode code point land.

If you are in a land of broadband it may not be as important, but if you are in dial-up or heavily metered territory, those extra bytes will cost. Especially if some or all of the communication will be in that three-byte or four-byte range.

I would argue that the USA is a country where connectivity among computer users including broadband is pretty prevalent. So its easy to look at the one language that pretty much stays in the one-byte per Unicode code point land and feel a little bitter, thinking about how one of the countries most able to afford the extra cost is the one that gets off cheapest.

And it is easy to wonder why e.g. Syriac is below 0x7ff when so many scripts in heavy modern use are above it (though I suspect if they had put Devanagari below and all the other Indics above -- for example -- we could have caused violence in lots of places!).

When you add to that the fact that for most Indics, which use the Virama/Abugida method to encode, native text in the script will be almost twice as big, taking two Unicode code points for all letters other than the ones that use the inherent "a" vowel.

Plus there are those alternate forms that require ZWJ and ZWNJ to be there too (I've talked about them before in blogs like Which form to use if the form keeps changing?). I'll remind everyone that the Unicode implementation suggestion from the Indic FAQ adds yet another character -- a three byte one -- the form most commonly used.

The upshot is that for the Indic scripts, the cost per linguistic character in the script is 3-6 bytes per (usually 6) with conjuncts being 9-15 or 15-21 or 21-27 bytes per.

Having spoken to several people in Chennai and Coimbatore and Bangalore and Hyderabad, being told that this cost is no big deal by people who aren't paying metered usage and who use just 1 byte per character sounds just a tad condescending to more than just a few of the billion plus people potentially impacted.

I can't argue with their logic, but I can say that one of the reasons so many of the original Indics were put together is that they were submitted at the same time by India, which really did think of them as a big Indic block. This is kind of what ewas asked for.

Now they did not get out of it completely unscathed. They too are paying that extra price too, for Hindi (including a non-trivial possible amount of those prices for conjuncts). So it is not like they get it any easier. But obviously if they had originally requested one or more of the Indics be done as syllabaries in the original proposal (Tamil and Bengali are the only two I have seen suggested in such a way by native speakers though to be fair I have not spent a ton of time looking for many others!), then just like Ethiopic they might have gotten it.

With the benefits of hindsight and post-mortem review and more knowledge, it is easy to criticize. And people do, in fact, criticize. Sometimes I think that is what 75% of this blog ends up being about.

For me the worst part is that some of the people who did the original work don't even see an issue here -- they feel quite good about all they have done and don't see the irony that happens when they are a bit piqued when those it was done for aren't "more appreciative." I have found myself apologizing for "those who don't know any better" quite a bit in recent times.

And talking about the importance of broadband (to help make sure everyone can cross the bridge that the water is running under)....

To be honest, it is almost frightening (though predictable in retrospect) how much better of a response one gets when one starts by trying to actually understand a concern rather than leading off by attacking it.

I questioned the placement of Syriac at the time, but it seems that there had always been a plan to reserve an area for RTL scripts, and that the assignment of U+590 to U+8FF as this area predates the existence of UTF-8. The Roadmaps <unicode.org/.../> highlight the three RTL areas in yellow, the other two being U+FB50 to U+FDFF and U+10800 to U+10FFF. Arguably Syriac belongs on Plane 1, but there really isn't any other current RTL script to place in the Plane 0 RTL area instead.

"This just isn't a problem if you use SCSU."

Or data compression.

Or you could make an encoding like GB18030 which puts this script in the number of bytes it has traditionally had, and everything else in more bytes.

What part of this is there more to than can be solved by compression? You haven't mentioned a single aspect that's not to do with data size [dial-up, metered usage]. Lots of protocols have support for content compression (e.g. HTTP's "Content-Encoding"), and that has the benefit of being [if I may mix water metaphors] a ship that has not already sailed - in that protocols can be extended [and support for such extensions more widely implemented] or replaced much more easily than scripts can be moved to a different Unicode block or done as syllabaries.

It's also a bit unfair to talk about 'bytes per linguistic character' like one of these linguistic characters only carries as much meaning as an english letter - if that were true there would only be 26 of them. A better comparison would be the encoded [and/or compressed] sizes of translations of some standard text. Is it really going to come out as 21-27 times as many bytes as the same text in english?

The way that Unicode encodes extra details that have nothing to do with the language cannot be completely compressed by a generic algorithm that has no awarenress of specific qualities of the language in question, particularly when it is mixed with another language like English that has different qualities. Expecting every browser and oS and device to have separate per-language compession is very unrealistic.

And I am talking about their language the way they talk about it and think about it -- which is the only way to really communicate with someone when you are telling them they have to make the sacrifices....

"The way that Unicode encodes extra details that have nothing to do with the language cannot be completely compressed by a generic algorithm that has no awarenress of specific qualities of the language in question"

Says who? Note that I'm not talking about watered-down "compression" like SCSU that never uses less than one byte per code point, i'm talking about real compression like gzip. That works by detecting patterns of bytes.* For example, the fact that the first two bytes of each code point is the same for all characters that are from the same 64-character block, and the fact that some three-byte (or longer) sequences appear particularly often in the data. It doesn't need to know that in advance to work, it's evident from the data. In theory you could maybe get an improvement for small texts [in any language] by having a language-specific starting dictionary, but that's a separate matter from what it would take to put it on equal footing with english.

I don't know what "And I am talking about their language the way they talk about it and think about it" was in response to, but the fact that a syllable [your 'linguistic characters'] is 'worth' more than a letter is not a matter of opinion - theirs, yours, or mine. So the "21-27 byte" number is manifestly unfair - if english has a size advantage due to how its script is encoded, it's 1:3 _if_ that. (The fair way to compare would be by taking translations of the same text in each language and measuring them)

* well, it works by using a sliding window dictionary, but that certainly benefits from the presence of patterns of bytes.