Newer, stronger, more case pair stability! The world's first 5.1 million dollar character encoding standard!

by Michael S. Kaplan, published on 2008/02/29 10:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/02/29/7948186.aspx


Please read the disclaimer; content not approved by Microsoft!

(Apologies to Steve Austin!)

The announcement went out to quite a few interested parties, it occurred to me that some of them might also be here!

It went like this:

The Unicode Consortium has recently strengthened the Unicode Character Encoding Stability Policy in accordance with the recommendations of the Unicode Technical Committee, adding the following new stability constraints:

In addition, the text of the Property Value stability constraints has been edited for clarity, adding the formal property names and property value names.

See http://www.unicode.org/policies/stability_policy.html

I added the emphasis above since it was an interesting point for Microsoft in particular

The exact addition was:

Case Pair Stability

Applicable Version: Unicode 5.0+

Two assigned characters form a case pair when the full uppercase of the first character is the second character, and the full lowercase of the second character is the first character.

If two characters form a case pair in a version of Unicode, they will remain a case pair in each subsequent version of Unicode.

If two characters do not form a case pair in a version of Unicode, they will never become a case pair in any subsequent version of Unicode.

More formally, for given versions V and U of Unicode, and any two characters X and Y that are both assigned according to both V and U::

toLowercaseV(X) = Y AND toUppercaseV(Y) = X

if and only if

toLowercaseU(X) = Y AND toUppercaseU(Y) = X

Note that these conditions apply to two existing, assigned characters. A character that is not part of a case pair could become part of one if the new case pair is formed at the time of the addition of a new character to Unicode. For example, a new capital version of U+028D ( ʍ ) LATIN SMALL LETTER TURNED W could be added in the future to form a new case pair.

You see, this was done largely at the request of Microsoft.

It was really due to the fact that both Unicode and Microsoft had casing stability rules that were not entirely compatible, a fact that could easily lead to future problems with Microsoft moving to keep more up to date with the standard (as they did in Vista) if issues like the additions to Unicode I talked about in Every character has a story #13: U+0241 and U+0294 (upper and lower case glottal stops) were to happen again.

Because there are so many components of Windows that depend on its casing tables, changes like that would really not be possible. Therefore being able to make sure that two letters that were defined in a version of the standard but were not considered to be cased variants of each other could sit in the same directory in an NTFS partition without some future version claiming that they could not anymore....

These are the fun effects that I am really happy Microsoft does, after spending so many years sitting in Unicode without trying to drive its own requirements of it software in the standard as well.

I remember talking with Asmus, Mark, and Ken at separate times before the meeting, and all of them were very supportive -- primarily since a Microsoft that is closer to the published standard is not just a good thing for Microsoft; it is also a good thing for Unicode!

So it just makes sense if their stability policies can be aligned.

Now at the same time, it is important (in my opinion) for Microsoft to not abuse the implied power in people thinking along those lines, especially watching how another not-too-long-ago example played out in the end (the Devanagari Sindhi characters escalated into Unicode 5.0 despite the synchronization issues with 10646 when the originally hoped-for but never promised implementations never managed to appear).

But examples like this are pretty rare, and Microsoft has in the past been less proactive about things than they probably should have been so as long as the behavior is responsible then I think it really is a good thing. I wish Microsoft could be more involved in standards like Unicode than they are, sometimes!

 

This post brought to you by , , , and ॿ (097b, 097c, 097e, and 097f -- DEVANAGARI LETTERS GGA, JJA, DDDS, and BBA -- the Sindhi implosives)


no comments

go to newer or older post, or back to index or month or day