by Michael S. Kaplan, published on 2010/11/09 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/11/09/10087473.aspx
The year was 2004.
The Blog you are reading now had just a few blogs in it.
And I wrote a blog titled Microsoft does not use the Unicode Collation Algorithm.
The year was 2008.
Thousands of blogs had been added to thie Blog since that earlier blog.
And I wrote a blog titled Microsoft still does not use the UCA; the converse is also true.
Nothing has changed, it is all still true.
Though over the years as these two different implementations worked to cover this single large space, their functionality has overlapped and each implementation has often in its efforts to do the right thing not paid enough heed when the other implementation had already realized that a particular solution was a bad idea for one reason or another.
Now by itself thisdoes not mean that it would necessarily be a mistake to solve the problem in that particular way -- at times there are underlying architectural reasons why the differences exist and there is not much reason to try and change those differences.
With all that said, part of a recent release to the Unicode Announcements alias struck me as interesting. The text of the announcement read in part:
Mountain View, CA, USA – October 29, 2010 – The new version of Unicode Technical Standard #10, Unicode Collation Algorithm (UCA), has been updated for Unicode Version 6.0, adding support for 2,088 characters in sorting, searching, and matching. Also in this release new data files for support of the Unicode Common Locale Data Repository (CLDR), which provides customization for different languages.
Reorderable Categories. The data files for CLDR order characters strictly by certain major categories. This allows programmers to parametrically reorder these groups of characters to put them in the desired order for different languages. For example, numbers can be ordered after letters, or Cyrillic before Latin. The reorderable categories are:
whitespace, punctuation, general symbols, currency symbols, and numbers, then Latin, Greek, Coptic, Cyrillic, ..., Egyptian Hieroglyphs, and finally, CJK.
Microsoft did something like that years ago.
Not a configurable system to do it, but an explicit change for one sort.
You may remember reading about it, in one or more of the following blogs:
As that last blog pointed out, we removed the customization because we ultimately deemed it to be not such a great idea.
Now perhaps what is being done differently here will make it not such a big deal that the latest version of Unicode added a flexible architectural feature that Microsoft started to realize was a bad idea at least eight years ago and finally removed from its implementation four years ago.
I don't know, since I haven't looked beyind the announcement itself.
Of course I have no way of knowing whether the issue was mentioned by any of the Microsoft representatives present (I wasn't there, and no one who was mentioned it to me until after everything was done and I was pointed at the announcement mail in as generic sense when everyone was).
Not to mention that to be honest I don't think very many of the support issues and problems that came up with this Korean "feature" (here or in Korea) ever made it to much into the public, either. We barely even documented it, except in one oblique doc comment that no one understood.
And those three blogs of mine,
Since Microsoft currently uses neither the Unicode Collation Algorithm nor the CLDR tailorings of it, I don't have too much of a specific business reason to do much more here.
But I figure I can mention it here, at least.
If people using this new flexible algorithm start running into strange complication, compatibility issues, or other problems....just remember I told everyone so, right here. Even if I did so just a bit too late.....
John Cowan on 9 Nov 2010 10:00 AM:
I have to admit I don't really understand from your posts just what the feature was that MS introduced and then removed, but it seems to have been specific to Korean. The new ICU feature has to do with tailoring the sort order at the script level rather than at the character level. It doesn't provide essentially new functionality, it just makes script-level reordering easier to do rather than having to go through and tailor each and every character of the script. So if you want Latin > Coptic > Greek in the index to your book on Coptic, for example, you can easily get it.
Michael S. Kaplan on 9 Nov 2010 2:35 PM:
The Korean feature was to put all the Hangul and Hanja in front of the other scripts, including Latin. So basically the kind of thing you could do with this new feature in the UCA in your app, though the app in this case is the size of Windows. :-)
go to newer or older post, or back to index or month or day