It would be like spelling it Anerica or something.

by Michael S. Kaplan, published on 2010/08/17 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/08/17/10050816.aspx

Now in the past I have talked about Microsoft's relationship with the Unicode Collation Algorithm, in blogs such as:

And I talked about some of the technical differences between the two, as well as some of the reasons behind the differences.

In my personal view, the UCA and the tailors in the CLDR still change a bit too often for my tastes, and I am mostly happy with what Microsoft does. But neither those facts nor the exceptions to the latter that make me say "mostly" are the subject of today's blog.

Today I'm going to talk about one of the biggest philosophical differences that I see as a blocking issue to the idea of using either the Unicode Collation Algorithm or several other Unicode standards in Microsoft products, except in the case of data being transmitted elsewhere.

It is not I am calling them bad -- they are not. It is just that the issue can really block the notion of considering certain operations to be desirable as built-in operations performed on all data.

Note that insome cases, such operations may be built in already in some applications or APIs -- my opinion on such is either already known and mentioned, or you can likely guess what it would be.

It is a principle you can see in Microsoft platform pieces, in for example other parts of Windows like NLS encodings and the code in and atop NTFS file system, a principle you can see in Jet Blue products (e.g. Exchange, Active Directory) and SQL Server.

It was not totally mention but sort of applies in Normalization and Microsoft -- whats the story?, too.

You may be able to guess what that principle is.

Your guess?

I'll just mention it, you can pretend you knew what I was getting at all along. :-)

It is leaving the data alone and not screwing with it.

Thus not uppercasing or lowercasing text just because you wanted to ignore case (and losing the information) and not normalizing to Unicode Normalization Form C or Form D (and losing something unque about the original form it was in) and ignoring certain "ignorable" characters (which turn out to change meanings when the characters are gone).

Just in case you were doing something special with that text that showed different results or looked different.

Now each of these operations from Microsoft's and/or a language's point of view can really be destructive to data, whether one imagines the stuff I mentioned in this blog or this other blog or 2.3b of Unicode's UAX 31 (Unicode Identifier and Pattern Syntax).

For that last case, the original assumptions about characters like the ignorability of ZERO WIDTH JOINER and ZERO WIDTH NON-JOINER led to several problems that languages such as Sinhalese were seeing strings broken through normal processes that Unicode was originally recommending, because the full consequences of algorithms and layout/shaping rules (and the interaction of all of them) were not fully understood. thus the following two strings:

Text (in case your browser knows how to render this properly)	Image (in case the text doesn't render right for you)	Unicode code points
ශ්‍රී ලංකා		0dc1 0dca 200d 0dbb 0dd3 0020 0dbd 0d82 0d9a 0dcf
ශ්රී ලංකා		0dc1 0dca 0dbb 0dd3 0020 0dbd 0d82 0d9a 0dcf

The first one of them is meaningful -- it actually is the term for the country name of Sri Lanka in Sinhala, the other is not. And the UAX lists other similar examples, though none perhaps as top level bad to get wrong as the name of a country.

Like spelling America as Anerica or something, because of some truncation operation that clipped a letter. Would you want such an operation running on your machine? :-)

Anyway, the fact that database platforms like Jet or SQL Server (and file system platforms like NTFS) do not normalize means that no version of these products screwed these strings up. And comparison operations worked to treat the equal things as being the same by storing the different forms with the same weights, never transforming the strings as part of the storage or the comparison logic.

As a point of comparison (by which I mean contrast), I am told that some platforms and databases transform the data and store just one form, since Unicode "rules" allow it and it makes some operations easier.

Now I could claim that this was because we were smarter, but I would be lying.

It was just that these platforms and products were formed in the primordial stew before Unicode had ideas like canonical equivalence and ignorable characters and such, then later no one wanted to change anything.

Partially this may have been laziness, and partially it was a reluctance to change code that worked. But even then for some the idea of not "screwing up" data was present in the minds of some people. I mean if people took the time to make something different then they may have had their reasons.

Thus Microsoft has had a long history of not wanting to go the Unicode way, since its eagerness for process and algorithm and operations has messed up things in an earlier version then fixed it in a later version feels a bit young, at times.

Of course with a new generation of people in charge of things and those who were there before either gone or just elsewhere, I am clearly speaking of the past; I have no idea if these philosophical principles still guide the product.

Though I am pretty sure NTFS will still keep working the same way no matter what happens. :-)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/01/14 Would you install the Emglish language version?

go to newer or older post, or back to index or month or day