The basics of supplementary
by Michael S. Kaplan, published on 2005/09/24 00:01 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/09/24/472543.aspx
I thought I would explain a bit more about how surrogates work in Unicode, since it does not seem very well described in a whole lot of places. First, some definitions (all from the Unicode Glossary and the Unicode Roadmap sites):
- Basic Multilingual Plane. Plane 0, abbreviated as BMP.
- High-Surrogate Code Point. A Unicode code point in the range U+D800 to U+DBFF. (See definition D25 in Section 3.8, Surrogates.)
- High-Surrogate Code Unit. A 16-bit code unit in the range D80016 to DBFF16, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate. (See definition D25a in Section 3.8, Surrogates.)
- Leading Surrogate. Synonym for high-surrogate code unit.
- Low-Surrogate Code Point. A Unicode code point in the range U+DC00 to U+DFFF. (See definition D26 in Section 3.8, Surrogates.)
- Low-Surrogate Code Unit. A 16-bit code unit in the range DC0016 to DFFF16, used in UTF-16 as the trailing code unit of a surrogate pair. Also known as a trailing surrogate. (See definition D26a in Section 3.8, Surrogates.)
- Plane. A range of 65,536 (1000016) contiguous Unicode code points, where the first code point is an integer multiple of 65,636 (1000016). Planes are numbered from 0 to 16, with the number being the first code point of the plane divided by 65,536. Thus Plane 0 is U+0000..U+FFFF, Plane 1 is U+10000..U+1FFFF, ..., and Plane 16 (1016) is U+100000..10FFFF. (Note that ISO/IEC 10646 uses hexadecimal notation for the plane numbers—for example, Plane B instead of Plane 11). (See Basic Multilingual Plane and supplementary planes.)
- Private Use. Refers to designated code points in the Unicode Standard or other character encoding standards whose interpretations are not specified in those standards and whose use may be determined by private agreement among cooperating users.
- Private-Use Code Point. Code points in the ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. (See definition D12 in Section 3.5, Properties.) These code points are designated in the Unicode Standard for private use.
- Reserved. Refers to undesignated code points, which are set aside for future standardization. (See Section 2.4, Code Points and Characters.)
- Supplementary Character. A Unicode encoded character having a supplementary code point.
- Supplementary Code Point. A Unicode code point between U+10000 and U+10FFFF.
- Supplementary Ideographic Plane. Plane 2, abbreviated as SIP.
- Supplementary Multilingual Plane. Plane 1, abbreviated as SMP.
- Supplementary Special-purpose Plane. Plane 14, abbreviated as SSP.
- Supplementary Planes. Planes 1 through 16, consisting of the supplementary code points.
- Surrogate Character. A misnomer. It would be an encoded character having a surrogate code point, which is impossible. Do not use this term. [I talk about this issue here]
- Surrogate Code Point. A Unicode code point in the range U+D800 through U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point.
- Surrogate Pair. A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit. (See definition D27 in Section 3.8, Surrogates.)
- Trailing Surrogate. Synonym for low-surrogate code unit.
- Unassigned. Code points that either are reserved for future use or are never to be used.
- Unassigned Character. Synonym for not assigned to an abstract character. This refers to surrogate code points, noncharacters, and reserved code points. (See Section 2.4, Code Points and Characters.)
- Unassigned Code Point. (See undesignated code point.)
- Undesignated Code Point. Synonym for reserved code point. These code points are reserved for future assignment and have no other designated normative function in the standard. (See Section 2.4, Code Points and Characters.)
- Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. (See definition D28 in Section 3.9, Unicode Encoding Forms.)
Ok, it is all as clear as mud now, right? :-)
The problem is that even if the definitions are applied consistently, there is no good feel for exactly how they work, how high and low surrogates combine, and so on.
(Other questions, like why do high surrogates have lower numbers than low surrogates are covered in other posts)
Let's see if we can't do something about that....
(Warning: some MATH content ahead!)
We start with the Basic Multilingual Plane -- it is the code units from U+0000 to U+FFFF. Some of these code points are assigned; and a large subset of those are assigned characters. In all there are 65,536 code units in this and every other plane; you can also think of this as 1000016 or just 216 code units. Whatever you find easiest, conceptually.
Now what happens with those high surrogate code points is that the block of 1024 of them are divided into 16 blocks of 64 each. And each one of those blocks is used for a plane:
- U+d800 - U+d83f (Plane 1, Supplementary Multilingual Plane)
- U+d840 - U+d87f (Plane 2, Supplementary Ideographic Plane)
- U+d880 - U+d8bf (Plane 3, Reserved)
- U+d8c0 - U+d8ff (Plane 4, Reserved)
- U+d900 - U+d93f (Plane 5, Reserved)
- U+d940 - U+d97f (Plane 6, Reserved)
- U+d980 - U+d9bf (Plane 7, Reserved)
- U+d9c0 - U+d9ff (Plane 8, Reserved)
- U+da00 - U+da3f (Plane 9, Reserved)
- U+da40 - U+da7f (Plane 10, Reserved)
- U+da80 - U+dabf (Plane 11, Reserved)
- U+dac0 - U+daff (Plane 12, Reserved)
- U+db00 - U+db3f (Plane 13, Reserved)
- U+db40 - U+db7f (Plane 14, Supplementary Special-purpose Plane)
- U+db80 - U+dbbf (Plane 15, Supplementary Private Use Area A)
- U+dbc0 - U+dbff (Plane 16, Supplementary Private Use Area B)
By convention, U+[##]FFFE and U+[##]FFFF of each plane are set aside and reserved, never to be assigned. This allows internal processes to use them as sentinels. Note that they should never be interchanged with any other process!
Now the way things are numbered, each high surrogate is used, serially, combining with every possible one of the 1024 low surrogates before moving onto the next high surrogate. Thus for supplementary characters you see the following order:
- U+d800 U+dc00 -> U+10000
- U+d800 U+dc01 -> U+10001
- U+d800 U+dc02 -> U+10002
- ...
- U+d800 U+dffd -> U+103fd
- U+d800 U+dffe -> U+103fe
- U+d800 U+dfff -> U+103ff
- U+d801 U+dc00 -> U+10400
- U+d801 U+dc01 -> U+10401
- U+d801 U+dc02 -> U+10402
- ...
- U+dbff U+dffd -> U+10fffd
- U+dbff U+dffe -> U+10fffe
- U+dbff U+dfff -> U+10ffff
(I skipped some spaces in there for obvious reasons!)
This mechanism allows for many things such as simple range checking and easy conversions between code point an surrogate pair (it is a simple algorithmic macro to do the conversion when/if it is ever needed).
When combined with the way that scripts are assigned in blocks, it is easy to notice things like the following (not a complete list, just a sample!):
- U+d800 -- contains Aegean Numbers, Linear B Syllabary, Linear B Ideograms, Ancient Greek Numbers, Old Italic, Gothic, Ugaritic, and Old Persian.
- U+d801 -- contains Deseret, Shavian, Osmanya
- U+d802 -- contains Cypriot
- U+d834 -- contains Byzantine Musical Symbols, Musical Symbols
- U+d835 -- contains Math Alphanumerics
So when you combine the BMP's 216 code units with the 16 planes of 64 * 1024 (which is also 216 code units!), you get 17 * 216 or 1,114,112 code units in total -- which is where that interestingly arbitrary-looking number comes from!
Unicode's Roadmap site has a lot of information about the potential placement of future character allocations in Unicode, for those who are interested.
And for a more reality-based set of links, if you look ahead to Windows Vista three macros have been added to the winnls.h that comes with the Vista SDK:
I would expect that the meanings are pretty self-explanatory, but if not you can look at the VSDK topics to which I linked. :-)
(On a side note, I find it very cool that the Windows Vista SDK is available right now to everyone, whether they are on the Vista beta or not. It really does help to explain features and functions!)
Now in future posts I could perhaps get into other topics, like algorithmic conversion between UTF-16 and UTF-32....
This post brought to you by all of the supplementary planes of Unicode
# Gabe on Saturday, September 24, 2005 7:56 AM:
OK, so now I'm only a little confused.
Why can't I use the term 'Surrogate Character' to refer to a character which is encoded as a surrogate pair?
Why didn't they use the high range of available surrogate code points for the high-surrogates and the low range for the low-surrogates? Are they intentionally trying to confuse us?
Why did they have to use the term 'Basic Multilingual Plane' (giving us the ambiguous BMP acronym) instead of perhaps General Multilingual Plane or Basic Polylingual Plane?
# Michael S. Kaplan on Saturday, September 24, 2005 9:28 AM:
# Simon Montagu on Sunday, September 25, 2005 3:27 AM:
Thank you for casting so much light on a murky area. This post is so good it seems churlish to go pointing out typos, but the last 3 lines of the table of supplementary characters should be:
# U+dbff U+dffd -> U+10fffd
# U+dbff U+dffe -> U+10fffe
# U+dbff U+dfff -> U+10ffff
# Michael S. Kaplan on Sunday, September 25, 2005 3:30 AM:
Not at all churlish, Simon -- and now fixed.... :-)
# Maurits on Monday, September 26, 2005 1:29 PM:
Stupid question time...
1) OK, so this whole "surrogate code point" thing is UTF-16's way of encoding supplementary codepoints > U+FFFF? And this is one of the "private use" ranges, so there's no way to know the desired character properties of code points in this range?
2) IS_SURROGATE_PAIR(wcH, wcL) == IS_HIGH_SURROGATE(wcH) && IS_LOW_SURROGATE(wcL)
3) Why did Microsoft standardize on UTF-16 for the .NET framework? Wouldn't it be more space-efficient to standardize on UTF-8 for Western European locales, and UTF-16 for East Asian locales? Or would that cause interop problems for network communications across locale boundaries? Even given the relative ease of switching between UTF-8 and UTF-16 on the fly?
4) It's kind of strange that 32 bits isn't enough. So UTF-32 really isn't a "flat map" to the Unicode code point system, because of U+10XXXX... Guess we need a UTF-33 encoding? ;)
# Maurits on Monday, September 26, 2005 1:47 PM:
Oh, I see... I was confusing U+10000 with U+100000
There are seventeen planes (0x0 through 0x10) and only 0xf and 0x10 are specifically private-use. 0x0 is the basic plane but there are well-established characters in other planes, for example OLD ITALIC LETTER A:
http://www.fileformat.info/info/unicode/char/010300/index.htm
So considering the CharNext interview question:
The UTF-16 way of encoding this particular character is with a surrogate pair. So, alas:
It is not sufficient to unilaterally skip all surrogate pairs (as this character, among others, would be skipped)
So the two feasible options are:
* to unilaterally return the first element of all surrogate pairs
* to come up with further logic to dictate when to return the first element, and when to skip
"Unilaterally return" is a pretty attractive strategy at this point. :) This would assume that all private-use supplementary characters (in Plane F and Plane... um... "G") are "spacing" characters, which seems a fair assumption.
And naturally, UTF-33 would not be enough... we'd need UTF-32 + 5 bits to handle the seventeen planes, to wit, UTF-37
# Michael S. Kaplan on Monday, September 26, 2005 2:47 PM:
UTF-32 encoded the same info as UTF-16, but in one flat plane.
# Maurits on Monday, September 26, 2005 2:51 PM:
> This mechanism allows for many things such as simple range checking and easy conversions between code point an surrogate pair (it is a simple algorithmic macro to do the conversion when/if it is ever needed).
Hmmm... like this?
/* Given a surrogate pair, returns the supplementary code point */
#define SUPPLEMENTARY_CODE_POINT(H, L) \
( \
/* optional checking */ \
0xd800 <= (H) && (H) < 0xdc00 && \
0xdc00 <= (L) && (L) < 0xe000 ? \
/* UTF16 -> Unicode code point (un)encoding */ \
0x10000 + ((H) - 0xd800) * 0x0400 + ((L) - 0xdc00) \
/* invalid input - TODO: go "boom" - return null for now */ \
: 0 \
)
/* Given a supplementary code point, returns the "high" surrogate pair element */
#define SURROGATE_PAIR_HIGH(U) \
( \
/* optional checking */ \
0x10000 <= (U) && (U) < 0x110000 ? \
/* Unicode code point -> UTF16 encoding */ \
/* Note in this case | does not work for + */ \
((((U) - 0x10000) >> 0xa) + 0xd800) \
/* invalid input - TODO: go "boom" - return null for now */ \
: 0 \
)
/* Given a supplementary code point, returns the "low" surrogate pair element */
#define SURROGATE_PAIR_LOW(U) \
( \
/* optional checking */ \
0x10000 <= (U) && (U) < 0x110000 ? \
/* Unicode code point -> UTF16 encoding */ \
/* Note in this case | does not work for + */ \
(((U) - 0x10000) & 0x03ff) + 0xdc00 \
/* invalid input - TODO: go "boom" - return null for now */ \
: 0 \
)
/* Given a supplementary code point, returns the high and low surrogate code pair as an unsigned int */
#define SURROGATE_PAIR(U) \
( \
(unsigned int)SURROGATE_PAIR_HIGH(U) << 0x10 | SURROGATE_PAIR_LOW(U) \
)
# Michael S. Kaplan on Monday, September 26, 2005 2:54 PM:
I would not tend to go boom -- easier to just return the original value if the return was not available....
# Maurits on Monday, September 26, 2005 3:28 PM:
>UTF-32 encoded the same info as UTF-16, but in one flat plane.
I should really think before I post.
Ah... one plane is 0000-FFFF - sixteen bits
Seventeen planes - need five bits to determine the plane...
That's only 21 bits. So UTF-32 is fine with supplementary planes.
In fact, the first eleven bits of every UTF-32 code point are always zero... so we're only using one 2**12'th of the address space, even with all the reserved planes and whatnot...
Ah, room to breathe :)
# Michael S. Kaplan on Monday, September 26, 2005 5:08 PM:
Hey, no worries. There are some people who do not even think after they post, let alone before. So you are still one up on many of them.
Please consider a
donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
referenced by
go to newer or older post, or back to index or month or day