Thinking beyond the BMP of Unicode

by Michael S. Kaplan, published on 2005/05/12 02:01 -04:00, original URI:

A few days ago, CornedBee left a comment to my post Raymond's Chinese dictionary:

> by some reports over 60,000 of the over 70,000 ideographs in Unicode/GB18030

Suddenly the 65000 characters in a Windows WCHAR or a VC++ wchar_t seem so little ... when will it be expanded to 32 bits? (Or is that HCHAR - for huge character?)

When thinking beyond the BMP (Basic Multilingual Plane) of Unicode, I think Dr. International said it best back the column put up in August 2000, entitled The real deal about surrogates.

Of course this was published before there were even officially assigned supplementary characters, though -- see the August 2002 post entitled Windows XP and Unicode Surrogate Code Points, CJK Extensions A and B for the update.

(Although note that the Doctor's first column officially disagrees with my UCS-2 vs. UTF-16 (not quite Kramer vs. Kramer) post, though since they are both two people trying to (as I described in my post) trying to put words to describe something that is really outside of anything related to the standard, it is not so much a contradiction as spinning the problem two different ways. I was able to get supplemenary character script working with an MSKLC on NT 4.0 if I installed the fonts -- does that mean I can say NT 4.0 supports UTF-16? <grin>)

In any case, the Doctor's original post deals with the specific question about whether new data types (and presumably new APIs) will be needed:

There are many people who are wondering why a move to UCS-4/UTF-32 (which uses 32 bits per character) is not being considered. The most common reason for people to make this suggestion is their concern that attempts to handle these extra characters via surrogates seems so much like the methods that DBCS required to support a mix of 8-bit and 16-bit characters. The need for functions such as IsDBCSLeadByte, etc. makes DBCS string handling very difficult, as many people can attest to. Surrogates, however, are always made up of 16-bit values, and both high and low surrogates are in specific ranges. This makes surrogates much easier to detect and it makes string handling routines much easier to implement. Given the heavy investment that the Windows platform and COM have in UCS-2/UTF-16, moving to UCS-4/UTF-32 would require a rearchitecture of Windows that while not as large as the move from Win16 to Win32, could certainly provide pain to programmers on the same order of magnitude. In short:

Now note that is not the same as conversion. Conversion can be crucial both for the sake of standards that use these other encoding forms and for interoperability with other platforms with different defauts. The work to convert to and from both UTF-8 and UTF-32 is potentially interesting; the former has been around since NT 4.0 SP3, and the former is being released with the Whidbey (2.0) version of the .NET Framework (and can be seen right now in the Beta 2 release).

It is also not the same as targetted features that use UTF-32, such as the WM_UNICHAR notification that Murray Sargent has often spoken about at Unicode conferences. The main difference here is items that are character-based rathr than string-based.

But nobody wants to take all of the string APIs, long plagued with needing "A" and "W" versions, and add yet another "H" version to them (to use CornedBee's nomenclature). There are enough developers in Windows to probably take us if we tried to claim that was a requirement, so we're probably lucky that it is not!


This post brought to you by "C" (U+0043, a.k.a. LATIN CAPITAL LETTER C)

# James Todd on 12 May 2005 9:59 AM:

I just wanted to throw out this relevant MSDN link about surrogate pairs.

Thanks for the information, Michael!


# Michael S. Kaplan on 12 May 2005 12:06 PM:

Hey James,

Yep, that is an interesting link, though I need to add a clarification to it about Both Windows 2000 and XP....

Basically, anything that turns on Uniscribe will turn on the surrogate pair detection stuff.

In XP, that means either the Complex Script support or the East Asian language support sill turn this on.

For Windows 2000, any of the "languages your system supports" that fall under one of these two categories will give the same result -- this means Japanese, Korean, Traditional Chinese, Simplified Chinese, Thai, Vietnamese, Hebrewm Arabic, or Indic.

It all boils down to language groups, interestingly enough. (cf: ).

# Qflash on 15 May 2005 2:38 AM:


referenced by

2008/09/08 UCS-2 to UTF-16, Part 1: Getting the obvious out of the way

go to newer or older post, or back to index or month or day