Not technically wrong

by Michael S. Kaplan, published on 2006/07/06 03:01 -04:00, original URI:

Sergei asks via the Contacting Michael... link:

Hello, Michael!

I have a question for you about Unicode.

The MSDN article Surrogates and Supplementary Characters contains some (I think) inaccurate information:

Naturally, most code points beyond the BMP do not yet have characters assigned to them, but this gives Unicode the potential to define 1,114,112 characters (that is, 216 * 17 characters) within the code point ranges U+0000 to U+10FFFF.


Using the surrogate mechanism, UTF-16 can support all 1,114,112 potential Unicode characters.

It is not possible to define 1,114,112 characters because of the values like D800-DFFF, FFFE, FFFF. (Maybe some others?)

And is U+0000 a character?

Thank you for your blog


Well, this is one of those weird areas, I guess. :-)

I mean, U+0000 is NULL, and even if it is not literally defined by Uniocde it is obviously defined in the world of software will NULL terminated strings. :-)

And in all 17 planes U+[[#]#]fffe and U+[[#]#]ffff are reserved to allow (among other things) for them to be used internally by software as a sentinel of some sort.

And high surrogate code units have a clear semantic, as do the low surrogate ones.

Now these purposes may be (in fact, in my mind they are) less noble than an assignment that serves a useful linguistic purpose like a letter in an alphabet.

They may even be less noble than the vast array of symbols that exist in Unicode (though this point is perhaps more debatable!).

But either way, it is clear that the mechanism of surrogate pairs to represent supplementary characters provides a mechanism that allows assignments that are repesentable within UTF-16 of 1,114,112 different characters.

What other things Unicode does with them later to whittle down availability within this huge collection is another matter entirely. :-)


This post brought to you by 𐄓 (U+10113, a.k.a. AEGEAN NUMBER FORTY)

Adam on 6 Jul 2006 8:34 AM:


go to newer or older post, or back to index or month or day