The trouble with HKSCS, for me

by Michael S. Kaplan, published on 2007/06/03 18:21 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/06/03/3069028.aspx


There is no real, official standard listing the characters used in Hong Kong.

Unfortunately.

I say unfortunately because of the whole repertoire fences thing I mentioned previously in relation to the Hong Kong Supplemental Character Set, a.k.a. HKSCS.

But here is where we run into problems, though.

Because HKSCS is not the full list of characters, not by a long shot.

In its latest incarnation (HKSCS 2004) it is 4941 characters.

But that is not the full list -- because it is basically a standard built atop Big5, an industry created version of CNS11643, a Taiwan standard.

(Conventional wisdom is that it was named after "the five big companies" in Taiwan in this part of the industry agreeing on a de facto standard, yet no one can name any of the companies which in my mind lends doubt to the veracity of the claim!)

HKSCS replaces some of the reference glyphs with some forms preferred in Hong Kong, and no one ever claimed that all of the rest of Big5 was used in Hong Kong.

Which is actually just as well, since which form of Big5 is to be used as the base is actually the one to start with is a big question -- because there are MANY of them.

Now people who work on fonts have a way to go here -- if there is a different glyph they know what to do. Though you could have filled a mailbox with the communication deciding how best to support the alternate forms in fonts!

But for collation, and even for code page, thins are much less clear.

I knew someone who was once joking about someone being the "least distinguished of the Distinguished Engineers" (and no, I will not talk about either who said it or who was being talked about, thank you very much!). It is in this vein that I tend to think of HKSCS as one of the least standard of the encoding standards -- it is a poorly defined (across versions) set atop an uncertain and ill-defined base, sometimes acting as as glyph standard (in suggesting alternate preferred forms for glyphs) and other times as an encoding standard, I honestly don't know what can be expected of it.

As a "repertoire fence", it clearly has a bunch of holes in it.

But when people come and make suggestions/bug reports in Vista such as:

Some HKSCS 2004 characters , for example listed in Unicode codepoint below:

u+216C1
u+36C8

are sorted by a wrong order according to the stroke-count.

I agree with them -- those code points (U+216c1 in Extension B CJK and U+36c8 in Extension A CJK), as well as several others like them, are not in the stroke count tables for Traditional Chinese in Vista. Since those tables are based on Taiwan's CNS 11643 and have 54,450 ideographs in them (and an unspecified number of other characters in Unicode that need no special ordering).

So how is one supposed to define where they go when the ordering is based on stroke count followed by position within CNS 11643 for ideographs that are not include of CNS 11643?

Which is not to say I would want or expect a standard in Taiwan to add characters to their standard that are not used in country. This is the point I find most annoying about the sorting data from the GB standards coming from PRC -- the fact that they order everything, including code points they do not use or recognize in any real sense as useful to people in country.

If those two code points were added using my belief in the stroke count or the PRC count (based on Simplified Chinese), or even based on the significantly more relevant Hong Kong stroke count data, what happens in the next version if CNS 11643 adds ideographs that sit in positions now set aside for Hong Kong to be either moved or have the sort in Taiwan be off a bit.

Ick.

What is truly needed is the repertoire fence for the Hong Kong data. That, an approved source for the stroke count data, and an official way to "break ties" for characters with equal stroke counts is what is needed to be support a sort for Hong Kong.

And HKSCS is not providing that, at the moment. At least not in a consumable form....

 

This post brought to you by 𡛁 (U+216c1, a Unicode CJK Extension B ideograph)


# Peter Karlsson on 4 Jun 2007 3:25 AM:

The biggest problem I see with Big5-HKSCS is that it seems to be expected to replace Big5 whenever used, so that you can either use Big5-HKSCS on your system, or “regular” Big5, but never both. At least it seems that if you install the HKSCS updates on Windows, it will replace the regular version.

Both forms are registered with IANA, as distinct encodings. This causes problems in several occasions, such as in http://my.opera.com/community/forums/findpost.pl?id=2049828 where the user expects to see HKSCS but sees regular Big5. Do you have any thoughts on this issue?

# SDiZ on 4 Jun 2007 4:57 AM:

The Big5s are "宏碁", "神通", "佳佳", "零壹" and "大眾". AFAIK, only "宏碁" and "神通" are still alive -- haven't heard any news from other three since mid-1990s.

Most HKSCS characters are not created in Hong Kong. Most of them are traditional Chinese characters used in 1800s.

And the "break-tire" rules for stroke count. okay... Did anyone told you Chinese dictionary never, ever sorted in stroke count? (nor is it sorted phonetically until recently) They are sorted in Radicals (see http://en.wikipedia.org/wiki/List_of_Kangxi_radicals).

# Michael S. Kaplan on 4 Jun 2007 7:43 AM:

Peter, the Big-5 change with HKSCS refers to the pre-Vista version -- I am refering to Vista and beyond....


go to newer or older post, or back to index or month or day