Some sorts resist the future

by Michael S. Kaplan, published on 2005/12/07 15:01 -05:00, original URI:

(I am not really picking on Japanese here, as there are many similar issues in other languages; this just happens to be one I have numbers for in front of me at this particular moment in time)

The sort that has been in Windows for a long time for Japanese has been based on the ordering in one of the older JIS standards. You know, something after JIS X 208. Not entirely in JIS order, but kind of close.

It includes 7,070 Unicode code points.

When you contrast that with JIS X 213, you get some interesting information.

JIS X 213 (the latest one) has 13,148 ideographs in it -- 303 from CJK Unified Ideographs Extension B (13MB), and 164 from CJK Unified Ideographs Extension A (1.5MB).

Plus you have 5,472 ideographs that were either in Unified CJK Ideographs (5MB) or Compatibility Ideographs (0.5MB).

Now I previously talked about using code pages as repertoire sources for languages (here and here) and obviously this huge set of additional ideographs that nearly doubles the number of ideographs in the sort might make a good example of a time to do this sort of thing....

Now here comes the problem, though -- those 5,472 ideographs have been in the default table of Windows for many years now (Win2000 and earlier), and those other 467 ideographs have had weight in the default table since Windows XP (I will talk more about Extensions A and B another day).

So if we add these code points to the Japanese table in just about any order at all based on JIS, we'd break anyone who was expecting there to be no changes in sort order ever, just on the basis of code points that are accepted enough in Japanese to be in JIS, but were not yet in our Japanese-specific table previously.

Of course we do have a versioning mechanism, so there is a way for us to tell any smart callers to expect there to be a change.

(I talked about that mechanism in posts like Collation data -- must be stable, but it must not stand still.)

On the other hand, we cannot assume that every single caller will be smart. In fact, we can assume that there will be a not insignificant percentage of callers that will be somewhat unsmart, and there could even be a few plusunsmart and doubleplusunsmart callers out there, too.

Add to the conundrum the question about the overall level of usefulness of a sort that is kind of in JIS order but not exactly, even today.... and then trying to extend it.

Quite a fine pickle, huh? We can either be compatible for their sake, or we can be meaningful for people who would like appropriate language support.

Compatible vs. meaningful? Yuck. No matter what we do, we'd be broken in somebody's eyes....

(hardly a new position for Microsoft, obviously)

It does make for an interesting problem, in any case. One that despite its resistance to a solution must indeed be solved!

More on this in the future, as things unfold further. You can consider this post to be a teaser for the description of a solution.... :-)


This post brought to you by "𠂉" (U+20089, an Extension B ideograph, one of the 303 referred to earlier)

# Mihai on 7 Dec 2005 4:24 PM:

"plusunsmart and doubleplusunsmart"

This kind of constructs tell us that Orwell did not know C++ :-)

I think this should be "minusunsmart and doubleminusunsmart", or --unsmart (or maybe even --(!smart) :-)

Anyway, I would say: "let the --unsmart to take a ++hit". But then, I don't have to deal with the complains MS will get :-)

# Michael S. Kaplan on 7 Dec 2005 5:13 PM:

Hey I am being consistent with words he did like doubleplusungood. Besides, minusunsmart violates the principles of newspeak by introducing a word when the antonym (plus) exists....

# Michael S. Kaplan on 7 Dec 2005 5:36 PM:

For the rest, there are very good reasons to not go ahead with the plan to let them take the hit without careful consideration....

# Mihai on 8 Dec 2005 12:03 PM:

I have tried to sprinkle all this with smilies, but it does not help.
This is whay online sucks! :-)

# Michael S. Kaplan on 8 Dec 2005 1:09 PM:

I know you were kidding, I did see the smiles.

But playing it straight is the essence of comedy! :-)

referenced by

2006/01/03 'Acceptable' Japanese sort order?

go to newer or older post, or back to index or month or day