by Michael S. Kaplan, published on 2005/07/27 19:47 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/27/444101.aspx
The title of this post, including the parenthetical note, is something that people associated with the Unicode Standard have to tell people all the time (of course generally people only say that parenthetical note to themselves, and really only because they have to say it so many times!).
The issue is clear in both the Unicode Glossary:
Surrogate Character. A misnomer. It would be an encoded character having a surrogate code point, which is impossible. Do not use this term.
and the Unicode FAQ:
Q: Are surrogate characters the same as supplementary characters?
A: This question shows a common confusion. It is very important to distinguish surrogate code points (in the range U+D800..U+DFFF) from supplementary code points (in the completely different range, U+10000..U+10FFFF). Surrogate code points are reserved for use, in pairs, in representing supplementary code points in UTF-16.
There are supplementary characters (i.e. encoded characters represented with a single supplementary code point), but there are not and will never be surrogate characters (i.e. encoded characters represented with a single surrogate code point).
In fact, if you look to the Unicode Roadmap, each plane has its own name:
They are supplementary characters, one and all. They are not surrogate characters. Truly.
This is easy, right?
Of course even the clearest intention will not always find itself communicated properly, which is why the Char.IsSurrogate method will have text like "Indicates whether a Unicode character is categorized as a surrogate character" or when the Windows CE docs say "For sorting, all surrogate pairs are treated as two Unicode code points. Surrogates are sorted after other Unicode code points, but before the PUA (private user area). Sorting for a standalone surrogate character (that is, either the high or low character is missing) is not supported.". I do mind the not-entirely-accurate statement about the collation, but I will talk about that another day!
I do not mind the surrogate character usage like that in the previous paragraph so much, as it is a more benign error -- when people say surrogate character in this context, they mean to say surrogate code point. Harmless error and it even shows up as a NULL glyph as if it were a character of some sort, and we can just the documentationl language at some point (hopefully soon, but I will not lose sleep if they do not).
The real problem case is when they try to equate the term surrogate character with the term surrogate pair. If they compound it by the naming the method that way, like the XmlWriter.WriteSurrogateCharEntity method, which in addition the evil method name, say things like:
When overridden in a derived class, generates and writes the surrogate character entity for the surrogate character pair.
This is a bit harder to fix (not the doc. portion, but the method name, which obviously cannot be removed.
But we'll figure something out. Eventually.
Until then, please remember what the title of this post is telling you -- there is no such thing as a surrogate character!
This post brought to you by U+D800, the first surrogate code point -- not a surrogate character!
(This code point has come to terms with his lack of character-ness, but has mentioned that the fact that no one else has may put him into therapy)
# Michael on 29 Jul 2005 3:43 AM:
# Michael S. Kaplan on 29 Jul 2005 11:13 AM:
# Ben Bryant on 29 Jul 2005 11:37 AM:
referenced by
2007/11/12 I'm simply saying that life^H^H^H^Hcharacters, uh... find a way
2007/10/23 If you would wait till I *FINISHED* what I was trying to say, you punk... (aka Premature validation)
2007/10/03 If it ain't UTF-16 then it ain't having no surrogate pairs, baby!
2006/02/06 Maybe there is such a thing as a surrogateS character (dammit!)
2005/09/24 The basics of supplementary
2005/07/31 Why my syndication links were broken....