Maybe there is such a thing as a surrogateS character (dammit!)

by Michael S. Kaplan, published on 2006/02/06 10:01 -05:00, original URI:

(No, the title of this post does not contain a typo!)

I have a regular reader of this blog who is a 12 year old young man named Dean.

He has an interesting take on my post There is no such thing as a surrogate character (dammit!).

Although he did not really follow all of the Unicode take on the evilness of the term "surrogate character," he pointed out that the real problem was not that "there is no such thing as a surrogate character" at all.

He suggested that we should allow people to call these characters that are made up of two surrogate code units by a simple term:


(the emphasis is mine)

When he first suggested it, I went back through previous mails from Dean that convinced me his age claim was genuine (up to and including his delight that I used the word dammit in a post title!).

It struck me as a much more brilliant compromise that more accurately resolves the problem of the natural tendency people seem to have to call these entities "surrogate characters" by shifting the battlefield in such a way that the language mavens, the grammar police, and the wordinistas can start battling for us!

And to be honest, Dean suggested that some of these mavens could perhaps help the cause, citing this post and several Language Log posts on the language maven issue.

Why not have these busybodies do some work for us, just for a change? :-)

Clearly there are two surrogate code units there, so calling the two of them a surrogate character is an obvious pluralization mismatch.

What do you think?

In my opinion, a touchdown (with the extra point), a field goal, and a safety for Dean, 12 points that the Seahawks could have used to win the Super Bowl yesterday! :-(


This post brought to you by "𐠠" (U+10820, a.k.a. U+d802 U+dc20, a.k.a. CYPRIOT SYLLABLE PI, a proud surrogates character!)

# Chris on 6 Feb 2006 1:24 PM:

As the Guinness Guys Say...


# Ben Bryant on 6 Feb 2006 1:24 PM:

Only if you aren't referring to a single 16-bit code point. But very nice touch!

# Maurits [MSFT] on 6 Feb 2006 7:28 PM:

A character is a character.  Surrogates are an artifact of trying to squeeze seventeen planes of data into sixteen bits* -- the Waterbed Theory of Complexity says that if you oversimplify in one area you get horribly confusing behavior elsewhere.

Surrogate-ness is not a quality of characters per se, but a quality of their UTF-16 representation.  There are alternative encodings.

* Yes, I know...

# Michael S. Kaplan on 6 Feb 2006 8:06 PM:

Yes, but people continue to call them surrogate characters, so why not correct their improper use of the language? :-)

# Steve on 7 Feb 2006 8:05 AM:

Correct me if I'm wrong (I usually am), but isn't it true that the word 'surrogate' only refers to the UTF-16 representation? So, for example, the two UTF-8 code units required to represent a character like 'Ĉ', U+0108, (random example) are not actually called surrogates, even though the principle is the same.

In which case, there is no real ambiguity to the term 'surrogates character' because it could only possibly be referring to the UTF-16 representation of a character...

Hehe, pedantry is fun... :o)

# Michael S. Kaplan on 7 Feb 2006 10:29 AM:

Hi Steve,

You are correct -- surrgoates characters are a UTF-16 only phenomenon (both high and low surrogates are illegal in UTF-8 and UTF-32).

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/12/25 While vacationing, idle random thoughts on the potential influence of Unicode on 'alphabet soup'

2007/11/12 I'm simply saying that life^H^H^H^Hcharacters, uh... find a way

2006/12/05 Validation of Unicode text is growing up

2006/04/19 They called it the 'Surrogate IME'

go to newer or older post, or back to index or month or day