There is no such thing as a surrogate character (dammit!)

by Michael S. Kaplan, published on 2005/07/27 19:47 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/27/444101.aspx

The title of this post, including the parenthetical note, is something that people associated with the Unicode Standard have to tell people all the time (of course generally people only say that parenthetical note to themselves, and really only because they have to say it so many times!).

The issue is clear in both the Unicode Glossary:

Surrogate Character. A misnomer. It would be an encoded character having a surrogate code point, which is impossible. Do not use this term.

and the Unicode FAQ:

Q: Are surrogate characters the same as supplementary characters?

A: This question shows a common confusion. It is very important to distinguish surrogate code points (in the range U+D800..U+DFFF) from supplementary code points (in the completely different range, U+10000..U+10FFFF). Surrogate code points are reserved for use, in pairs, in representing supplementary code points in UTF-16.

There are supplementary characters (i.e. encoded characters represented with a single supplementary code point), but there are not and will never be surrogate characters (i.e. encoded characters represented with a single surrogate code point).

In fact, if you look to the Unicode Roadmap, each plane has its own name:

Plane 0: BMP (Basic Multilingual Plane)
Plane 1: SMP (Supplementary Multilingual Plane)
Plane 2: SIP (Supplementary Ideographic Plane)
Plane 14: SSP (Supplementary Special-Use Plane)

They are supplementary characters, one and all. They are not surrogate characters. Truly.

This is easy, right?

Of course even the clearest intention will not always find itself communicated properly, which is why the Char.IsSurrogate method will have text like "Indicates whether a Unicode character is categorized as a surrogate character" or when the Windows CE docs say "For sorting, all surrogate pairs are treated as two Unicode code points. Surrogates are sorted after other Unicode code points, but before the PUA (private user area). Sorting for a standalone surrogate character (that is, either the high or low character is missing) is not supported.". I do mind the not-entirely-accurate statement about the collation, but I will talk about that another day!

I do not mind the surrogate character usage like that in the previous paragraph so much, as it is a more benign error -- when people say surrogate character in this context, they mean to say surrogate code point. Harmless error and it even shows up as a NULL glyph as if it were a character of some sort, and we can just the documentationl language at some point (hopefully soon, but I will not lose sleep if they do not).

The real problem case is when they try to equate the term surrogate character with the term surrogate pair. If they compound it by the naming the method that way, like the XmlWriter.WriteSurrogateCharEntity method, which in addition the evil method name, say things like:

When overridden in a derived class, generates and writes the surrogate character entity for the surrogate character pair.

This is a bit harder to fix (not the doc. portion, but the method name, which obviously cannot be removed.

But we'll figure something out. Eventually.

Until then, please remember what the title of this post is telling you -- there is no such thing as a surrogate character!

This post brought to you by U+D800, the first surrogate code point -- not a surrogate character!
(This code point has come to terms with his lack of character-ness, but has mentioned that the fact that no one else has may put him into therapy)

# Michael on 29 Jul 2005 3:43 AM:

The evil like Char.IsSurrogate comes from the fact that Char managed type, as well as wchar_t (under Windows) really represent just two bytes in Utf16 encoding, not the Unicode character. I always mentally translate it, and then Utf16CodePoint.IsSurrogate does makes sense :)

# Michael S. Kaplan on 29 Jul 2005 11:13 AM:

Well, that is the slightly more benign use (IMHO) -- both types are defined for MS platforms as using UTF-16, an it is just asking if the thing in the Char or WCHAR is a surrogate....

# Ben Bryant on 29 Jul 2005 11:37 AM:

Well put. I'll try to always keep this in mind. But looking at XmlWriter WriteSurrogateCharEntity it seems to be named to be consistent with WriteCharEntity which I think is actually incorrect in the first place -- it is a "Numeric Character Reference", not an "Entity Reference" and certainly leaving off the term "Reference" or at least "Ref" is bad usage; it should be WriteCharRef and WriteSupplementaryCharRef for use with UTF-16. They also have WriteEntityRef which is actually an "Entity Reference," so they apparently thought leaving the "Ref" part off would make it a Numeric Character Reference! This whole API shows a real confusion in terms, in particular a craziness around the term "Entity." Dare Obasanjo oversaw this right? He should explain it. But coming back to your point, the Char in the method name may be intending to refer to the output (which is actually a supplementary character) and not the surrogate pair that is input judging from the correlation with WriteCharEntity.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/11/12 I'm simply saying that life^H^H^H^Hcharacters, uh... find a way

2007/10/23 If you would wait till I *FINISHED* what I was trying to say, you punk... (aka Premature validation)

2007/10/03 If it ain't UTF-16 then it ain't having no surrogate pairs, baby!

2006/02/06 Maybe there is such a thing as a surrogateS character (dammit!)

2005/09/24 The basics of supplementary

2005/07/31 Why my syndication links were broken....

go to newer or older post, or back to index or month or day