Unicode isn't advanaced mathematics

by Michael S. Kaplan, published on 2006/04/13 12:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/04/13/575703.aspx

As Peter Vogel pointed out about Unicode a few years back, Unicode is hardly nuclear physics. I am pretty sure its not advanced mathematics, either (a point that was just proven again last night!).

It has been a while since I have posted about things from the Unicode List, though I think that has mostly been to help keep up the respect for standards of my regular readers. :-)

But Mark Davis did post a pretty helpful little chart about Unicode 5.0 last night that gives the official count of characters in Unicode as of Unicode 5.0 (set to be released soon!):

I tend to use the following (ie, excluding private use & noncharacters).

Unicode
2.0.0 2.1.2 3.0.0 3.1.0 3.2.0 4.0.0 4.1.0 5.0.0

Letter 36,121 36,121 45,443 89,762 89,957 90,547 91,395 92,496

Mark 446 446 575 605 653 941 1,009 1,065

Number 374 374 431 486 536 612 695 836

Punctuation 240 240 288 288 351 360 420 437

Symbol 1,671 1,673 2,414 2,851 3,508 3,764 3,978 4,032

Separator 17 17 19 19 20 21 20 20

Control/Format 81 81 89 194 196 202 203 203

Graphic+C/F 38,950 38,952 49,259 94,205 95,221 96,447 97,720 99,089

Too bad we were just 11 short of 100,000 for 5.0!

Mark

Of course some people pointed out the small math mistake there, though Curtis Clark had the funniest way of saying it:

Maybe it's a floating point error in my calc.exe, but I come up with 911
short (which will be just about right to encode *all* the Phaistos
characters, when the rest of the corpus is discovered). :-)

I'll probably post about the Phaistos characters another day, after the dust settles on the proposal....

It is all goodness though; it was a small typo (this morning Mark admitted to his '...math mistake; sorry for the confusion....'), and there is plenty of time to find 911 new characters (and probably more than that) for Unicode 5.1! :-)

This post brought to you by "8" and "∞" (U+0038 and U+221e, a.k.a. DIGIT EIGHT and INFINITY)
Two characters who are great friends, merely a single best fit mapping away from each other, and quite confusable if you are lying down!

# Mihai on 13 Apr 2006 2:35 PM:

It is interesting to see that the biggest jump happened between 3.0 and 3.1 (where in the software world dot releases are regarded more as bug fixes).

# Maurits [MSFT] on 13 Apr 2006 3:47 PM:

What separator was lost between 4.0 and 4.1?

# Michael S. Kaplan on 13 Apr 2006 3:57 PM:

???? Huh?

# Maurits [MSFT] on 13 Apr 2006 4:10 PM:

Separators:
2.0.0: 17
2.1.2: 17
3.0.0: 19
3.1.0: 19
3.2.0: 20
4.0.0: 21
4.1.0: 20
5.0.0: 20

Did a separator get reclassified in 4.1?

# Michael S. Kaplan on 13 Apr 2006 4:21 PM:

Yes, this does happen sometimes -- a GC change will cause such categories as those in the list above to change counts....

I do not know offhand what changed, but it an easy diff between two versions of unicodedata.txt.

# Maurits [MSFT] on 13 Apr 2006 4:30 PM:

Found it...
U+200B ZERO WIDTH SPACE
http://www.fileformat.info/info/unicode/char/200b/index.htm

In 4.0.0 it was class Zs; in 4.1.0 it was class Cf.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day

Unicode	2.0.0	2.1.2	3.0.0	3.1.0	3.2.0	4.0.0	4.1.0	5.0.0
Letter	36,121	36,121	45,443	89,762	89,957	90,547	91,395	92,496
Mark	446	446	575	605	653	941	1,009	1,065
Number	374	374	431	486	536	612	695	836
Punctuation	240	240	288	288	351	360	420	437
Symbol	1,671	1,673	2,414	2,851	3,508	3,764	3,978	4,032
Separator	17	17	19	19	20	21	20	20
Control/Format	81	81	89	194	196	202	203	203
Graphic+C/F	38,950	38,952	49,259	94,205	95,221	96,447	97,720	99,089