Putting the 'U' in Unicode (and the 'G' in Galacticode)

by Michael S. Kaplan, published on 2007/06/08 16:04 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/06/08/3168164.aspx

In a recent thread on the perennial topic on The Unicode List where someone suggests that 17 planes is not enough room and that the rest of the code space beyond U+10ffff is needed (in this case in the context of non-terrestrial/extra-terrestrial scripts!), Ken Whistler gave the stats on the Unicode odometer for the next update (as well as some data about character additions):

A propos the discussion about 17 planes, UTF-16, and extraterrestrial characters, I have gone ahead and done the preliminary calculations on what we can expect in terms of numbers of characters for Unicode 5.1, now due sometime next spring, based on the current contents of Amendments 3 and 4 to 10646:2003.

Comparing Unicode 5.0 and 5.1 for the main figures of concern:

                    5.0     5.1
BMP characters    52013   53439
SMP+characters    47007   47315
Total characters  99020  100754
Total designated 238667  240401
Total reserved   875445  873711

"Characters" here refers to the sum of regular graphic characters and Unicode format controls, the "traditional" Unicode count.

"Designated" also includes ISO control codes, noncharacters, private use characters, and the surrogate code points.

"Reserved" is everything else -- the totally unassigned code points still available for encoding characters.

As you can see, we have hardly made a dent in that figure.

Also, to give you a concrete idea of the current character encoding "velocity", if you take the number of characters added since the last big anomalous jump in content (Extension B in 2001), and average it over the time from 2001 to the anticipated release of Unicode 5.1 in 2008, the per annum character encoding rate for WG2 and the UTC is 944 characters/year (and trending down).

Now we know that some large collections are still to go, particularly for the various East Asian ideographic collections. In addition to CJK Extensions C and D, there is also Old Hanzi (seal script, etc.), Tangut, and Khitan.

And there are more Egyptian hieroglyphs and Sumerian cuneiform to go. Let's take some worst case scenarios and assume those all get done in 2008 and all come in on the large side:

CJK Extension C: 4213
CJK Extension D: 8000
Old Hanzi:       8000
Tangut:          5910
Khitan:          5000
Yi ideographs:   7000
Egyptian basic:  1063
Egyptian ext:    8000
Cuneiform:       1000

O.k., that's another 48,186 characters. Let's assign all these heavy hitters to allocations, and *then* assume that the WG2 and UTC committees will still find enough left over to keep plugging away at 1000 characters per year, indefinitely.

How long have we got?

(873,711 - 48,186) / 1000 = 825 years

Oh dear, it looks like I underestimated before when I said it would take 800 years to fill the 17 planes.

Quick, someone get busy on contacting the Orionids!


Now I suppose I could go on about the rest of the messages in the thread as Peter Constable bemusedly wondered whether the Ancient Egyptians would have the same influence of the Koreans have to be able to get the precomposed hieroglyphics in, or the speculation about extra-terrestrials and their character encoding needs, or Dave Starner's curiosity about whether the first extra-terrestrial contact might be a formal request for information so that Unicode could be added to Galacticode, or when Michael Maxwell brought it full circle by reminding us all that according to the movie Stargate when we talk about Egyptians we really are talking about extra-terrestrials, anyway. :-)

But I think it exhausting to even ponder capturing all of that so I'll just let the summary and the text in the above paragraph be the extent of it....


This post brought to you by (U+1e21, a.k.a. LATIN SMALL LETTER G WITH MACRON)

