by Michael S. Kaplan, published on 2005/12/07 15:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/12/07/500869.aspx
(I am not really picking on Japanese here, as there are many similar issues in other languages; this just happens to be one I have numbers for in front of me at this particular moment in time)
The sort that has been in Windows for a long time for Japanese has been based on the ordering in one of the older JIS standards. You know, something after JIS X 208. Not entirely in JIS order, but kind of close.
It includes 7,070 Unicode code points.
When you contrast that with JIS X 213, you get some interesting information.
JIS X 213 (the latest one) has 13,148 ideographs in it -- 303 from CJK Unified Ideographs Extension B (13MB), and 164 from CJK Unified Ideographs Extension A (1.5MB).
Plus you have 5,472 ideographs that were either in Unified CJK Ideographs (5MB) or Compatibility Ideographs (0.5MB).
Now I previously talked about using code pages as repertoire sources for languages (here and here) and obviously this huge set of additional ideographs that nearly doubles the number of ideographs in the sort might make a good example of a time to do this sort of thing....
Now here comes the problem, though -- those 5,472 ideographs have been in the default table of Windows for many years now (Win2000 and earlier), and those other 467 ideographs have had weight in the default table since Windows XP (I will talk more about Extensions A and B another day).
So if we add these code points to the Japanese table in just about any order at all based on JIS, we'd break anyone who was expecting there to be no changes in sort order ever, just on the basis of code points that are accepted enough in Japanese to be in JIS, but were not yet in our Japanese-specific table previously.
Of course we do have a versioning mechanism, so there is a way for us to tell any smart callers to expect there to be a change.
(I talked about that mechanism in posts like Collation data -- must be stable, but it must not stand still.)
On the other hand, we cannot assume that every single caller will be smart. In fact, we can assume that there will be a not insignificant percentage of callers that will be somewhat unsmart, and there could even be a few plusunsmart and doubleplusunsmart callers out there, too.
Add to the conundrum the question about the overall level of usefulness of a sort that is kind of in JIS order but not exactly, even today.... and then trying to extend it.
Quite a fine pickle, huh? We can either be compatible for their sake, or we can be meaningful for people who would like appropriate language support.
Compatible vs. meaningful? Yuck. No matter what we do, we'd be broken in somebody's eyes....
(hardly a new position for Microsoft, obviously)
It does make for an interesting problem, in any case. One that despite its resistance to a solution must indeed be solved!
More on this in the future, as things unfold further. You can consider this post to be a teaser for the description of a solution.... :-)
This post brought to you by "𠂉" (U+20089, an Extension B ideograph, one of the 303 referred to earlier)
# Mihai on 7 Dec 2005 4:24 PM:
# Michael S. Kaplan on 7 Dec 2005 5:13 PM:
# Michael S. Kaplan on 7 Dec 2005 5:36 PM:
# Mihai on 8 Dec 2005 12:03 PM:
# Michael S. Kaplan on 8 Dec 2005 1:09 PM:
referenced by
2006/01/03 'Acceptable' Japanese sort order?