SharePoint and CJK Extensions A, B, C, D, and even E?

by Michael S. Kaplan, published on 2011/12/12 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/12/12/10246749.aspx


So, the question I got the other day was:

We are setting up SharePoint and want to know what collation to use. What support does SQL Server have for CJK Extensions A/B/C/D?

Now that's an interesting question.

If you think of SQL Server 2000 as the first version to support the current architecture of collation in SQL Server, it is fair to say that SQL Server 2000 did not support any of those four CJK extension ranges.

Similarly, Windows 2000 didn't support any of them either.

Then,starting in XP and continuing in Windows Server 2003, something interesting happened.

Basically, support was added for CJK Extension A that placed all of Extension A at the end of the list in the default table.

And also support was added for all of the high and low surrogates in planes 1, 2, 15, and 16.

This was done using the same info i added to The basics of supplementary for those four planes, by assigning weights to:

Two interesting side effects here -- first, the non char acer sentinels in each plane were given weight, and two every character in Planes 1 and 2 whether they had characters assigned yet or not, were given some weight.

Now note that Extension B, Extension C, and Extension D are all located in Plane 2 -- which means that every single ideograph in CJK Extension B that was assigned at the time, the CJK Extension C and CJK Extension D that were assigned later, and all of the not yet assigned space including the part roadmap'ed as being CJK Extension E were all given weight.

Code point order, of course. But some order is better than giving them no weight, right? :-)

SQL Server 2005 basically picked up these additions, but only for a few of the newly added collations.

They thus introduced the notion of having code points that have weight in some collations but not others.

But again just code point order within the ranges (and Extension A after Plane 2).

now enter Vista and Windows Server 2008 and Windows 7 and Windows Server 2008 R2 and SQL Server 2008 and SQL Server 2008 R2, which all added sorts with linguistic relevance to (depending on the collation) some or all of the ideographs in CJK Extensions A and B.

And every ideograph not included there keeps those same default weights that stick them at the end (though at least we put Extension A before its later counterparts!).

Note that no linguistically relevant info is used for CJK Extensions C and D....

Anyway, that answers the question about SharePoint, I think. :-)


John Cowan on 12 Dec 2011 8:37 AM:

(Crap, your comment system threw away my comment again.  Gotta remember to *always* save it to the clipboard before clicking "Post".)

I personally think that the powers that be (the IRG, the UTC, Microsoft) fell down on the job here.  As new extensions were added to CJK, the URO should have been properly extended so that the default collation table correctly interfiled the new characters into a single overall radical-stroke order.  Or if not the default table (as that would make it very large), at least a defined alternative table.

The job would have been (would be) very messy and error-prone due to its size, as well as somewhat arbitrary, given that the rarer a hanzi is, the harder it is to answer the persistent question "What is the radical?"  Still, I think it ought to have been (should be) attempted.

Michael S. Kaplan on 12 Dec 2011 9:49 AM:

Well, the sitution is made more complex for other reasons -- some technical, some political. Probably it would make for a nice future blog!


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day