Stability of the Unicode Character Database

by Michael S. Kaplan, published on 2005/03/12 20:33 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/03/12/394716.aspx

The following was overheard on the Unicode List:

Erik van der Poel asked:

Has anyone done a UCD stability survey? The kind of info that I would like to have is, for example, the percentage of characters that have a change in their General Category Value from one version to the next, starting from the beginning (Unicode 1.1.5).

Andrew C. West stepped up with some figures:

According to my calculations, the number of characters which changed their General Category from one version of Unicode to the next is :

1.1.5 -> 2.0.14 = 474 (1.384%)
2.0.14 -> 2.1.2 = 1 (0.0025%)
2.1.2 -> 2.1.5 = 16 (0.0410%)
2.1.5 -> 2.1.8 = 18 (0.0462%)
2.1.8 -> 2.1.9 = 3 (0.0077%)
2.1.9 -> 3.0.0 = 85 (0.2182%)
3.0.0 -> 3.0.1 = 0 (0%)
3.0.1 -> 3.1.0 = 3 (0.0061%)
3.1.0 -> 3.2.0 = 7 (0.0074%)
3.2.0 -> 4.0.0 = 16 (0.0168%)
4.0.0 -> 4.0.1 = 1 (0.0010%)
4.0.1 -> 4.1.0 = 12 (0.0124%)

I don't know what this tells you about the stability of the UCD data though.

Now the above table was a lot of infomation, but it is hard to know how to judge the meaning of these changes. Luckily, Ken Whistler noticed this problem and gave the list some context:

The raw number of characters changing is less reflective of stability than considering how many *decisions* to change a property (of one or more characters) were taken.

I intersperse some notes to Andrew West's calculated numbers below, to help put this in context.

> > 1.1.5 -> 2.0.14 = 474 (1.384%)

Many, many, changes, since 1.1.5 was developed in house, without general public review, and since 2.0.14 (the data version corresponding to Unicode 2.0) was the first public release of the data files.

> > 2.0.14 -> 2.1.2 = 1 (0.0025%)

1 decision

> > 2.1.2 -> 2.1.5 = 16 (0.0410%)

2 decisions: addition of Pi/Pf subcategories, and 1 fix for 8 Tibetan characters

> > 2.1.5 -> 2.1.8 = 18 (0.0462%)

1 decision: changes to converge identifier definitions

> > 2.1.8 -> 2.1.9 = 3 (0.0077%)

2 decisions: fix for Greek numeral signs; fix for halfwidth forms light vertical

> > 2.1.9 -> 3.0.0 = 85 (0.2182%)

I'd have to dig further for this, but these were likely mostly changes involved in nailing down normalization for Unicode 3.0.

> > 3.0.0 -> 3.0.1 = 0 (0%)
> > 3.0.1 -> 3.1.0 = 3 (0.0061%)

1 decision: 3 Runic golden numbers

> > 3.1.0 -> 3.2.0 = 7 (0.0074%)

5 decisions: 2 fixes for Khmer signs, 1 for Tamil aytham, 1 for Arabic end of ayah (architectural), 1 for the 3 Mongolian free variation selectors

> > 3.2.0 -> 4.0.0 = 16 (0.0168%)

2 decisions: 1 fix for 12 modifier letters, 1 fix for decimal digit alignment

> > 4.0.0 -> 4.0.1 = 1 (0.0010%)

1 decision: fix for ZWSP

> > 4.0.1 -> 4.1.0 = 12 (0.0124%)

3 decisions: 1 fix for Ethiopic digits, 1 for 2 Katakana middle dots, 1 for Yi syllable wu

Ken then went on to summarize the issues behind these numbers:

The significant point of instability in General Category assignments was in establishing Unicode 2.0 data files (now more than 8 years in the past).

There was a significant hiccup for Unicode 3.0, at the point when it became clear that normalization stability was going to be a major issue, and when the data was culled for consistency under canonical and compatibility equivalence.

Since that time, the UTC has been very conservative, indeed, in approving any General Category change for an existing character. The types of changes have been limited to:

Clarification regarding obscure characters for which insufficient information was available earlier.

Establishment of further data consistency constraints (this impacted some numeric categories, and also explains the change for the Katakana middle dot)

Implementation issues with a few format characters (ZWSP, Arabic end of ayah, Mongolian free variation selectors)

Since the publication of Unicode 3.0 in 2000, the only significantly common-use characters that had any General Category change were:

U+0B83 TAMIL SIGN VISARGA (=aytham, Tamil data)
U+200B ZERO WIDTH SPACE (mostly relevant to Thai data)
U+30FB KATAKANA MIDDLE DOT (Japanese)

Of those 3, only U+30FB would exist in any commonly interchanged character set other than Unicode, and *that* change was merely to change a punctuation subclass (gc=Pc --> gc=Po) -- and was additionally a *reversion* to the General Category assignment that U+30FB had in 2.1.5 and earlier.

Excellent -- the most amazing part is that he does so much of that from memory!

So what does that mean for us in the world of the .NET Framework and the new class in Whidbey that captures (among other items) the Unicode general category, as described in A little bit about the new CharUnicodeInfo class?

Well, it means two thing, primarily:

These values will not change very often.
There are times that some will change. Not many, and there is always a carefully thought out reason, but it can happen. And the class is not called "CharMicrosoftSpinOnUnicode" which means that by and large the class needs to follow the standard. Any code that you write using the CharUnicodeInfo class must take this into account....

As Microsoft gets better and better about standards, it will become more and more important for code to recognize that this sort of thing is possible.

Another day I will talk about the Windows side of this story, which is not nearly as cut and dried, or as tidy. But it is an interesting story.... :-)

This post brought to you by "þ" (U+00fe, a.k.a. LATIN SMALL LETTER THORN)

TAMIL DATASHEETS on 18 Apr 2009 3:15 PM:

தமிழ் தரவுத்தாள் தளம்

Tamil Electronics Datasheets

www.tamildata.co.cc

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2006/08/10 Roman numerals are Latin script!

2005/09/09 Update on the CharUnicodeInfo class

go to newer or older post, or back to index or month or day