Unicode isn't advanaced mathematics

by Michael S. Kaplan, published on 2006/04/13 12:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/04/13/575703.aspx


As Peter Vogel pointed out about Unicode a few years back, Unicode is hardly nuclear physics. I am pretty sure its not advanced mathematics, either (a point that was just proven again last night!).

It has been a while since I have posted about things from the Unicode List, though I think that has mostly been to help keep up the respect for standards of my regular readers. :-)

But Mark Davis did post a pretty helpful little chart about Unicode 5.0 last night that gives the official count of characters in Unicode as of Unicode 5.0 (set to be released soon!):

I tend to use the following (ie, excluding private use & noncharacters).

Unicode
2.0.0 2.1.2 3.0.0 3.1.0 3.2.0 4.0.0 4.1.0 5.0.0
Letter 36,121 36,121 45,443 89,762 89,957 90,547 91,395 92,496
Mark 446 446 575 605 653 941 1,009 1,065
Number 374 374 431 486 536 612 695 836
Punctuation 240 240 288 288 351 360 420 437
Symbol 1,671 1,673 2,414 2,851 3,508 3,764 3,978 4,032
Separator 17 17 19 19 20 21 20 20
Control/Format 81 81 89 194 196 202 203 203
Graphic+C/F 38,950 38,952 49,259 94,205 95,221 96,447 97,720 99,089


Too bad we were just 11 short of 100,000 for 5.0!

Mark

Of course some people pointed out the small math mistake there, though Curtis Clark had the funniest way of saying it:

Maybe it's a floating point error in my calc.exe, but I come up with 911
short (which will be just about right to encode *all* the Phaistos
characters, when the rest of the corpus is discovered). :-)

I'll probably post about the Phaistos characters another day, after the dust settles on the proposal....

It is all goodness though; it was a small typo (this morning Mark admitted to his '...math mistake; sorry for the confusion....'), and there is plenty of time to find 911 new characters (and probably more than that) for Unicode 5.1! :-)

 

This post brought to you by "8" and "∞" (U+0038 and U+221e, a.k.a. DIGIT EIGHT and INFINITY)
Two characters who are great friends, merely a single best fit mapping away from each other, and quite confusable if you are lying down!


# Mihai on 13 Apr 2006 2:35 PM:

It is interesting to see that the biggest jump happened between 3.0 and 3.1 (where in the software world dot releases are regarded more as bug fixes).

# Maurits [MSFT] on 13 Apr 2006 3:47 PM:

What separator was lost between 4.0 and 4.1?

# Michael S. Kaplan on 13 Apr 2006 3:57 PM:

???? Huh?

# Maurits [MSFT] on 13 Apr 2006 4:10 PM:

Separators:
2.0.0: 17
2.1.2: 17
3.0.0: 19
3.1.0: 19
3.2.0: 20
4.0.0: 21
4.1.0: 20
5.0.0: 20

Did a separator get reclassified in 4.1?

# Michael S. Kaplan on 13 Apr 2006 4:21 PM:

Yes, this does happen sometimes -- a GC change will cause such categories as those in the list above to change counts....

I do not know offhand what changed, but it an easy diff between two versions of unicodedata.txt.

# Maurits [MSFT] on 13 Apr 2006 4:30 PM:

Found it...
U+200B ZERO WIDTH SPACE
http://www.fileformat.info/info/unicode/char/200b/index.htm

In 4.0.0 it was class Zs; in 4.1.0 it was class Cf.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day