What do they mean when they say 'GB18030 Characters' ?

by Michael S. Kaplan, published on 2007/02/28 04:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/02/28/1772891.aspx


(This could probably get turned into a series with various terms....)

A very common question that comes up in internationalization circles is some variation on:

How can I get GB18030 characters to test in my application?

As questions go, it sounds like they are thinking more about repertoire than about text actually encoded as GB18030. Like their own little repertoire fence!

So the easy answer is to point out that since GB18030 is a PRC China standard that is completely tied to Unicode and thus any character that is a "Unicode character" is also a "GB18030 character". As fences go, it is not very limiting, after all....

That is also almost certainly not the answer they are looking for, since they weren't really asking the question in a very good way.

Let's take a step back and try and figure out what they might really be trying to ask.

Now it is all well and good to think that China cares tremendously about every single code point in Unicode from U+0001 to U+10ffff, but in practice we know that is not true. We know that they have priorities, just like we all do.

There are actually two meanings that are most commonly intended, and every single one of the 81 emails I have seen over the last year with the phrase "GB18030 characters" referred to one or both of these definitions.

Definition #1: "GB18030 Characters" is an alias for Unicode Extension B Characters. (Warning -- that link is to a 13mb PDF!)

This definition is from people who are looking to make sure that their application supports supplementary characters, like these ones in the Supplementary Ideographic Plane (more info on this name here). These characters are important to China, even though many of them are not actually used in China other than in very rare contexts, if at all.

This is not a foolish question, you know. I mean, this is a set of characters that Windows 2000, Microsoft Jet, SQL Server <= 2000, and lots of other products don't support. So if someone is asking about how to get characters to test in their application then this is a good thing. :-)

Definition #2:  "GB18030 Characters" is an alias for Chinese Minority Scripts, e.g. Mongolian, Tibetan, Uighur (which means Arabic), Yi, Phags-Pa, New Tai Lue, Tai Le, etc.

This definition is from people who are looking to make sure that their application supports the various scripts used throughout different minority languages in China. Now these characters are also important to China, if for no other reason than they want to be able to show the native speakers of the languages that use them that the scripts are important to China. There are several parts of the world where this phenomenon occurs, and in a future post I will talk about some of the side effects of it....

In any case, this is not a foolish question either, as the majority of these characters, in addition to not being supported in Windows 2000, Microsoft Jet, SQL Server <= 2000 and so on, are also not supported in Window XP, Server 2003, or even the .NET Framework 1.0 or 1.1 (they are not supported/supportable in the .NET Framework 2.0/3.0 either, unless you are running on Windows Vista).

So why do people ask the way that they do?

Well, usually they have GB18030 compliance of their product on their minds, and thus they want to know how to make sure that they are in fact compliant. The easiest way to do that is try out some of these characters.

Since the question is not foolish (even though it is ill-formed), I always try to find out what they are actually asking, so I can point them the right way....

 

This post brought to you by  (U+a840, a.k.a. PHAGS-PA LETTER KA)


iamduyu on 28 Feb 2007 6:04 AM:

GB=Guo Biao=national standard

Daniel on 28 Feb 2007 10:36 AM:

China Government have a test suite on GB18030 compatibility. It covers convertion between GB18030 and UTF-8 on some unassigned codepoints.

Michael S. Kaplan on 28 Feb 2007 7:25 PM:

Yes, but that doesn't help with this definition when people ask the question....

Yuhong Bao on 30 Jul 2010 1:03 PM:

Definition #1 is I think basically GB18030-2000.

Definition #2 is I think basically GB18030-2005.


go to newer or older post, or back to index or month or day