Why do the high surrogates have the low numbers?

by Michael S. Kaplan, published on 2005/07/31 20:50 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/31/445850.aspx


This is a question with true 'drive on a parkway, but park on a driveway' feel to it, but one that I have been asked by many people.

If you look at the surrogate range and its definition in Section 3.8 of the Unicode Standard:

3.8 Surrogates

D25 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.

D25a High-surrogate code unit: A 16-bit code unit in the range D80016 to DBFF16, used in

UTF-16 as the leading code unit of a surrogate pair.

D26 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.

D26a Low-surrogate code unit: A 16-bit code unit in the range DC0016 to DFFF16, used in

UTF-16 as the trailing code unit of a surrogate pair.

• High-surrogate and low-surrogate code points are designated only for that use.

you may not find the conformance definitions to be too terribly useful here -- they confirm what we already knew. So what is the story?

Well, a lot of it has to do with the way human beings try to equate what we understand to what we do not.

I remember trying to explain to someone about our collation weighing system, and the way we gave the items that come earlier 'less weight' so that they come first. He was confused because he was thinking about it like an indicator that went from 0 to 100 -- the items you wanted to have first would thus be 'heavier' so they would sink to the bottom of the list.

Now this person was not 'wrong' conceptually, it was just that his model did not match ours. :-)

So it is with the high and low surrogates. The high ones, which come first any time you have a legal surrogate pair, are assigned first. Since they are assigned earlier in the range of possible code points, they have lower numbers (0xd800 is a lower number than 0xdc00 in any computer language I have ever heard of), but no one was really thinking about the low/high surrogate thing in terms of code point values, they were thinking of the 'high that comes first' instead.

In case you are still rebelling against the conceptual disconnect, keep in mind that people say "WE'RE #1" to mean that they have a higher ranking than #2 and #3 and so on, despite the fact that the numbers are lower. That may help people to see that we each have our own assumptions about ranking and ordering that do not always use the same model....


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/07/25 Why are the short names longer than than long names?

2008/11/13 No need to throw out the baby with the streamwriter; they probably could have just put in a replacement

2006/08/10 Why are there MODIFIER LETTERS that are not in the Letter, Modifier category?

2005/11/13 Which comes first, 'a' or 'A' ?

2005/09/24 The basics of supplementary

go to newer or older post, or back to index or month or day