You can't get this particular bit of proverbial toothpaste back into the tube

by Michael S. Kaplan, published on 2010/04/20 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/04/20/9998063.aspx

Before I forget, I'll wish everyone a happy 420 on this day, 4/20. I'm just saying....

It is often quite ironic that the main point of an inquiry is buried at the end of the inquiry.

Sometimes this is done to take emphasis away from the main point as a means to avoid showing bias.

I'll give a likely example of this morning of the first day of the Text Summit (for you MS internal folks!), talking about something vaguely relevant to the Summit itself but completely irrelevant to pretty much everything being discussed at the Summit itself, including my own talk being given there. :-)

So anyhow, a recent question to the Contact link left me a little nonplussed:

I have looked extensively on your blog but did not receive an answer to this question, so I will just ask directly.

If Microsoft claims to support Unicode, how can it not put the equivalence between Unicode Normalization Forms C and D into collation for Korean?

At first I was not sure how to respond.

I mean, I feel like I had answered this question before in blogs like Theory vs. practice for Korean text collation and Theory vs. practice for Korean text collation, redux.

The conformance to Unicode issue vis-a-vis the UTS 10: Unicode Collation Algorithm is really not an issue at all; as the Unicode Standard itself says in the UCA document which spells out the meaning of a UTS (Unicode Technical Standard):

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Instead of UTS 10, Microsoft has its own independent support of collation, a support that by and large supports much of the intent of UTS 10. it even predates UTS 10; when one considers the number of times that Microsoft weighed in with thoughts/opinions on UTS 10 starting from when it was DUTR 10, images of Microsoft's feature telling the young DUTR "I'm am your father" are only squelched to avoid "evil empire" jokes.

Now Microsoft supports Unicode, in many ways. Via usage in its products, via hosting their main offices in one of our own campuses in Mountain View CA, via its full membership, via board membership at present and many times in the past.

More to the point at hand, Microsoft supports it by supporting Normalization as defined in UAX 15: Unicode Normalization Forms, which as in the previous case spells out the meaning of a UAX (Unicode Standard Annex):

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part.

Normalization is something that Microsoft has largely supported in a de facto manner in both fonts and collation before it was formally defined as a standard, with the bulk of the exceptions rightfully considered either a) bugs to fix, or b) explicit design decisions not to.

For the Normalization conversion itself, it is completely supported in Microsoft products and has been for many years.

So, to review:

Microsoft supports UAX 15;
Microsoft does not support UTS 10;
Microsoft's own independent implementation of collation is

intended to support the same requirements as UTS 10,
was largely created before either standard existed,
and in de facto manner happens to support most of both UAX 15 and UTS 10 in its operations.

Now obviously this talk of "largely supports" and "supports most of" has an obvious implicit statement that there are times the support isn't there.

And the biggest "exception" to the idea of generally supporting them both is, ironically in the Alanis or maybe Britney sense, the most likely central point of that original question I was asked, even though it was the word at the end:

Korean.

Now Korean has an interesting place in languages, and in Unicode.

A "perfect alphabet" developed in the 1500's by a king who wanted it to be easy for everyone to read and write, something opposed by the powers that could (as I discussed slightly in some of the introductory material to Traditional versus modern sorts), one could argue that its encoding in Unicode and ISO 10646 is anything but perfect.

It is not only technically encoded 4 times in Unicode (as I mentioned in One more thing about Korean....).

But because of the very natural and rational direct connection between the composed Hangul and decomposed Jamo, an operation that has a sound linguistic basis that no unbiased linguist would challenge, Unicode and ISO 10646 has borne the displeasure of the government for so long that the war of attrition finally succeeded in getting some Jamo added that were in theory already encoded (ref: Using a character proposal for a 'repertoire fence' extension).

Though there were no widely used implementations available in practice, due to the government itself actively seeking to discourage those implementations that existed.

I have referred to the chutzpah of killing one's parents and begging the court for leniency on the grounds of being an orphan in the past. :-)

In that context, one can look at this exception to the degree to which Microsoft supports UAX 15 in collation as largely an effort in support of the government's desire to not so widely support an equivalence between the two forms of Korean characters, for modern Hangul.

A good way to placate, if nothing else.

From technical and "almost a linguist, minus the education" perspectives I may find the solution unsatisfying, though the workaround is easy enough: convert the string to a particular Normalization form and then compare them. This allows both the people who know they are the same to be happy to see their knowledge confirmed while still allowing those who need to differentiate them to feel they are treated differently.

To be honest, Unicode does what it does via a long process that started as a sensible defining of canonical decompositions that finally became Normalization, all in a way that made conformance guarantees to Unicode and compliance requirements to other standards that use the Normalization definitions.

I suspect if they knew everything we know now, even those initial canonical decompositions would likely have been defined as something else (not compatibility decompositions but some other, third type), and would have saved over a decade of headaches, both political and the other kind.

But it is too late now, as you can't get this particular bit of proverbial toothpaste back into the tube.

In the current situation, there is nothing you can't get if you use the support methods and functions that Microsoft provides. Even if you have to work a little harder to make some of it happen....

And Microsoft is conformant to Unicode here, completely. It "fails" the test of being 100% conformant to the goals of UTS 10 in an area that Unicode itself would likely skip if it could. Which is a wonderful advantage Microsoft (and every other company) has over Unicode in this case, in my opinion!

The moral of the story -- be careful what you promise. When there are people recording what is said, at least!

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/04/21 If no one supported the OLD Old proposal, jumping in to support the NEW Old proposal may not make sense…

go to newer or older post, or back to index or month or day