It's a good thing the UTC has access to a thesaurus!

by Michael S. Kaplan, published on 2011/09/07, original URI: http://blogs.msdn.com/b/michkap/archive/2011/09/07/10207073.aspx


Unicode is a complex standard.

Not everyone finds it easy or exciting to read (there are parts that I have read that I would recommend to insomniacs who fail to respond to strong drugs!), but part of  documenting a complex standard is sucking it up and trying to capture the issue.

It can be quite easy to trip over the text sometimes, though....

Like the other day, when Andrew pointed out an issue that he had found confusing:

Actually, whether surrogate pairs are valid in UTF-8 has been confusing me in the past days, because some documents tell exactly what you say, however, some documents still mention the similar thing in UTF-8, which made me think that it looks still valid in UTF-8. If I misunderstood anything in these documents, please help correct me. Thanks.

UNICODE 6.0 Spec:
3.8 Surrogates
D75 ...

Sometimes high-surrogate code units are referred to as leading surrogates. Low surrogate code units are then referred to as trailing surrogates. This is analogous to usage in UTF-8, which has leading bytes and trailing bytes.

[Andrew:] My understanding is no matter how they are called, such a code point sequence is possible in UTF-8.

MSC10-C. Character Encoding - UTF8 Related Issues:
Broken Surrogates

Encoding of individual or out of order surrogate halves should not be permitted. Broken surrogates are invalid in Unicode and introduce ambiguity when they appear in Unicode data. Broken surrogates are often signs of bad data transmission. They can also indicate internal bugs in an application or intentional efforts to find security vulnerabilities. 

[Andrew:] Broken surrogates are invalid. Does it imply that normal surrogate pairs are valid?

Corrigendum #1: UTF-8 Shortest Form:
D36

(c) An irregular UTF-8 code unit sequence is a six-byte sequence where the first three bytes correspond to a high surrogate, and the next three bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-8 sequences shall not be generated by a conformant process.

[Andrew:] The surrogate pair here is named as “an irregular UTF-8 code unit sequence”, not “invalid” or “illegal”. Does it mean it’s still valid?

It is easy to see where an honest reading can lead to confusion, especially since the reading might be by one person with a specific point in mind while the standard itself is written by many people across many years with a lot of different points guiding them.

Irregular?

Invalid?

Illegal?

Thankfully the members of the Unicode Technical Committee are well-read, and have access to not only a thesaurus but to their own glossary (though one can't find those particular terms there, many other terms that could be confusing are available).

I'll leave the deconstruction of the arguments here as an exercise for the reader, to see who wants to try and tackle it....

If no one does then eventually I'll do it!


comments not archived

go to newer or older post, or back to index or month or day