The basics of supplementary

by Michael S. Kaplan, published on 2005/09/24 00:01 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/09/24/472543.aspx


I thought I would explain a bit more about how surrogates work in Unicode, since it does not seem very well described in a whole lot of places. First, some definitions (all from the Unicode Glossary and the Unicode Roadmap sites):

Ok, it is all as clear as mud now, right? :-)

The problem is that even if the definitions are applied consistently, there is no good feel for exactly how they work, how high and low surrogates combine, and so on.

(Other questions, like why do high surrogates have lower numbers than low surrogates are covered in other posts)

Let's see if we can't do something about that....

(Warning: some MATH content ahead!)

We start with the Basic Multilingual Plane -- it is the code units from U+0000 to U+FFFF. Some of these code points are assigned; and a large subset of those are assigned characters. In all there are 65,536 code units in this and every other plane; you can also think of this as 1000016 or just 216 code units. Whatever you find easiest, conceptually.

Now what happens with those high surrogate code points is that the block of 1024 of them are divided into 16 blocks of 64 each. And each one of those blocks is used for a plane:

By convention, U+[##]FFFE and U+[##]FFFF of each plane are set aside and reserved, never to be assigned. This allows internal processes to use them as sentinels. Note that they should never be interchanged with any other process!

Now the way things are numbered, each high surrogate is used, serially, combining with every possible one of the 1024 low surrogates before moving onto the next high surrogate. Thus for supplementary characters you see the following order:

(I skipped some spaces in there for obvious reasons!)

This mechanism allows for many things such as simple range checking and easy conversions between code point an surrogate pair (it is a simple algorithmic macro to do the conversion when/if it is ever needed).

When combined with the way that scripts are assigned in blocks, it is easy to notice things like the following (not a complete list, just a sample!):

So when you combine the BMP's 216 code units with the 16 planes of 64 * 1024 (which is also 216 code units!), you get 17 * 216 or 1,114,112 code units in total -- which is where that interestingly arbitrary-looking number comes from!

Unicode's Roadmap site has a lot of information about the potential placement of future character allocations in Unicode, for those who are interested.

And for a more reality-based set of links, if you look ahead to Windows Vista three macros have been added to the winnls.h that comes with the Vista SDK:

I would expect that the meanings are pretty self-explanatory, but if not you can look at the VSDK topics to which I linked. :-)

(On a side note, I find it very cool that the Windows Vista SDK is available right now to everyone, whether they are on the Vista beta or not. It really does help to explain features and functions!)

Now in future posts I could perhaps get into other topics, like algorithmic conversion between UTF-16 and UTF-32....

 

This post brought to you by all of the supplementary planes of Unicode


# Gabe on Saturday, September 24, 2005 7:56 AM:

OK, so now I'm only a little confused.

Why can't I use the term 'Surrogate Character' to refer to a character which is encoded as a surrogate pair?

Why didn't they use the high range of available surrogate code points for the high-surrogates and the low range for the low-surrogates? Are they intentionally trying to confuse us?

Why did they have to use the term 'Basic Multilingual Plane' (giving us the ambiguous BMP acronym) instead of perhaps General Multilingual Plane or Basic Polylingual Plane?

# Michael S. Kaplan on Saturday, September 24, 2005 9:28 AM:

Hi Gabe --

For the first question, see http://blogs.msdn.com/michkap/archive/2005/07/27/444101.aspx

For the second, see http://blogs.msdn.com/michkap/archive/2005/07/31/445850.aspx

For the third, it is not ambiguous to Unicode people. :-)

# Simon Montagu on Sunday, September 25, 2005 3:27 AM:

Thank you for casting so much light on a murky area. This post is so good it seems churlish to go pointing out typos, but the last 3 lines of the table of supplementary characters should be:

# U+dbff U+dffd -> U+10fffd
# U+dbff U+dffe -> U+10fffe
# U+dbff U+dfff -> U+10ffff

# Michael S. Kaplan on Sunday, September 25, 2005 3:30 AM:

Not at all churlish, Simon -- and now fixed.... :-)

# Maurits on Monday, September 26, 2005 1:29 PM:

Stupid question time...

1) OK, so this whole "surrogate code point" thing is UTF-16's way of encoding supplementary codepoints > U+FFFF? And this is one of the "private use" ranges, so there's no way to know the desired character properties of code points in this range?

2) IS_SURROGATE_PAIR(wcH, wcL) == IS_HIGH_SURROGATE(wcH) && IS_LOW_SURROGATE(wcL)

3) Why did Microsoft standardize on UTF-16 for the .NET framework? Wouldn't it be more space-efficient to standardize on UTF-8 for Western European locales, and UTF-16 for East Asian locales? Or would that cause interop problems for network communications across locale boundaries? Even given the relative ease of switching between UTF-8 and UTF-16 on the fly?

4) It's kind of strange that 32 bits isn't enough. So UTF-32 really isn't a "flat map" to the Unicode code point system, because of U+10XXXX... Guess we need a UTF-33 encoding? ;)

# Maurits on Monday, September 26, 2005 1:47 PM:

Oh, I see... I was confusing U+10000 with U+100000

There are seventeen planes (0x0 through 0x10) and only 0xf and 0x10 are specifically private-use. 0x0 is the basic plane but there are well-established characters in other planes, for example OLD ITALIC LETTER A:
http://www.fileformat.info/info/unicode/char/010300/index.htm

So considering the CharNext interview question:
The UTF-16 way of encoding this particular character is with a surrogate pair. So, alas:

It is not sufficient to unilaterally skip all surrogate pairs (as this character, among others, would be skipped)

So the two feasible options are:
* to unilaterally return the first element of all surrogate pairs
* to come up with further logic to dictate when to return the first element, and when to skip

"Unilaterally return" is a pretty attractive strategy at this point. :) This would assume that all private-use supplementary characters (in Plane F and Plane... um... "G") are "spacing" characters, which seems a fair assumption.

And naturally, UTF-33 would not be enough... we'd need UTF-32 + 5 bits to handle the seventeen planes, to wit, UTF-37

# Michael S. Kaplan on Monday, September 26, 2005 2:47 PM:

UTF-32 encoded the same info as UTF-16, but in one flat plane.

# Maurits on Monday, September 26, 2005 2:51 PM:

> This mechanism allows for many things such as simple range checking and easy conversions between code point an surrogate pair (it is a simple algorithmic macro to do the conversion when/if it is ever needed).

Hmmm... like this?

/* Given a surrogate pair, returns the supplementary code point */
#define SUPPLEMENTARY_CODE_POINT(H, L) \
( \
/* optional checking */ \
0xd800 <= (H) && (H) < 0xdc00 && \
0xdc00 <= (L) && (L) < 0xe000 ? \
/* UTF16 -> Unicode code point (un)encoding */ \
0x10000 + ((H) - 0xd800) * 0x0400 + ((L) - 0xdc00) \
/* invalid input - TODO: go "boom" - return null for now */ \
: 0 \
)

/* Given a supplementary code point, returns the "high" surrogate pair element */
#define SURROGATE_PAIR_HIGH(U) \
( \
/* optional checking */ \
0x10000 <= (U) && (U) < 0x110000 ? \
/* Unicode code point -> UTF16 encoding */ \
/* Note in this case | does not work for + */ \
((((U) - 0x10000) >> 0xa) + 0xd800) \
/* invalid input - TODO: go "boom" - return null for now */ \
: 0 \
)

/* Given a supplementary code point, returns the "low" surrogate pair element */
#define SURROGATE_PAIR_LOW(U) \
( \
/* optional checking */ \
0x10000 <= (U) && (U) < 0x110000 ? \
/* Unicode code point -> UTF16 encoding */ \
/* Note in this case | does not work for + */ \
(((U) - 0x10000) & 0x03ff) + 0xdc00 \
/* invalid input - TODO: go "boom" - return null for now */ \
: 0 \
)

/* Given a supplementary code point, returns the high and low surrogate code pair as an unsigned int */
#define SURROGATE_PAIR(U) \
( \
(unsigned int)SURROGATE_PAIR_HIGH(U) << 0x10 | SURROGATE_PAIR_LOW(U) \
)

# Michael S. Kaplan on Monday, September 26, 2005 2:54 PM:

I would not tend to go boom -- easier to just return the original value if the return was not available....

# Maurits on Monday, September 26, 2005 3:28 PM:

>UTF-32 encoded the same info as UTF-16, but in one flat plane.

I should really think before I post.

Ah... one plane is 0000-FFFF - sixteen bits
Seventeen planes - need five bits to determine the plane...

That's only 21 bits. So UTF-32 is fine with supplementary planes.

In fact, the first eleven bits of every UTF-32 code point are always zero... so we're only using one 2**12'th of the address space, even with all the reserved planes and whatnot...

Ah, room to breathe :)

# Michael S. Kaplan on Monday, September 26, 2005 5:08 PM:

Hey, no worries. There are some people who do not even think after they post, let alone before. So you are still one up on many of them.

referenced by

2011/12/12 SharePoint and CJK Extensions A, B, C, D, and even E?

2008/11/13 No need to throw out the baby with the streamwriter; they probably could have just put in a replacement

2007/10/23 If you would wait till I *FINISHED* what I was trying to say, you punk... (aka Premature validation)

2007/10/23 If working above U+FFFF is a problem n your program, then so is the basic stuff, too

2007/08/03 What is SORT_INVARIANT_MATH for?

2007/03/04 String Indexing?

2007/02/28 What do they mean when they say 'GB18030 Characters' ?

2006/12/05 Validation of Unicode text is growing up

2006/09/21 Aim higher if you are trying to hit Plane 2!

2006/01/05 A script, by any other name

go to newer or older post, or back to index or month or day