If it ain't UTF-16 then it ain't having no surrogate pairs, baby!

by Michael S. Kaplan, published on 2007/10/03 10:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/03/5259246.aspx

The other day, Ramanathan asked:


I have the following surrogate character  𠄃  that can be encoded as �� or as 𠄃If its encoded as �� it fails to parse using .NET XmlReader class. It also does
not work in the ActiveXObject("Msxml2.DOMDocument"). However it works with the .NET XmlDocument class. (code snippet and error shown below)

However the 𠄃 encoding works well with all the Xml parsers. Is there something wrong with the �� encoding. Is the 𠄃 style of encoding the recommended way to encode surrogate characters ?


           string s = "<ROOT>" + AntiXss.XmlEncode("abc 𠄃 def") + "</ROOT>";
           XmlDocument xmlDoc = new XmlDocument();
           string text = xmlDoc.FirstChild.InnerText;
           System.Console.WriteLine("Element data using XmlDom: {0}", text);
           XmlReader reader = XmlReader.Create(new StringReader(s));
           bool b = reader.Read();
           text = reader.ReadElementContentAsString();
           System.Console.WriteLine("Element data using XmlReader: {0}", text);

     Element data using XmlDom: abc ?? def

Unhandled Exception: System.Xml.XmlException: '?', hexadecimal value 0xD840, is an invalid character. Line 1, position 9.
   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
   at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args)
   at System.Xml.XmlTextReaderImpl.ThrowInvalidChar(Int32 pos, Char invChar)
   at System.Xml.XmlTextReaderImpl.ParseNumericCharRefInline(Int32 startPos, Boolean expand, BufferBuilder internalSubsetBuilder, Int32& charCount, EntityType&entityType)
   at System.Xml.XmlTextReaderImpl.ParseCharRefInline(Int32 startPos, Int32& charCount, EntityType& entityType)
   at System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars)
   at System.Xml.XmlTextReaderImpl.ParseText()
   at System.Xml.XmlTextReaderImpl.ParseElementContent()
   at System.Xml.XmlTextReaderImpl.Read()
   at System.Xml.XmlReader.SetupReadElementContentAsXxx(String methodName)
   at System.Xml.XmlReader.ReadElementContentAsString()
   at xss.XmlReaderIssue() in c:\ramu\code\cs\xss.cs:line 70
   at xss.Main(String[] args) in c:\ramu\code\cs\xss.cs:line 16

Regular readers might know what is going on here from that very first part if they keep in mind that the default encoding for XmlDocument is UTF-8....

Well, that and they look at the character in question:

&#55360;&#56579; or &#x20103;

or the piece of the exception text:

'?', hexadecimal value 0xD840, is an invalid character.

In UTF-8, you cannot use surrogate pairs as they are illegal everywhere other than UTF-16.

You have either use the UTF-32 NCR (&#x20103;) or the UTF-8 byte sequence for the character (F0 A0 84 83).

Looking at past posts like There is no such thing as a surrogate character (dammit!) you can get more info on supplementary characters like U+20103. And definitely be sure to keep surrogate pairs like U+d840 U+dd03 as a UTF-16 only thing....

But one important issue to not lose sight of here is that the original bug is in the page creation. The line of code was:

           string s = "<ROOT>" + AntiXss.XmlEncode("abc 𠄃 def") + "</ROOT>";

And if that is saved to a file then whatever saves it needs to do the proper encoding. The fact that it was put into a surrogate pair if the page is UTF-8 is the bug; using an NCR value is a workaround....


This post brought to you by 𠄃 (U+20103, a.k.a. U+d840 U+dd03, a CJK Extension B ideograph)

# John Cowan on 3 Oct 2007 9:29 PM:

You've conflated two issues here.  On the one hand, XML character references never permit surrogates in *any* character encoding: they match up with Unicode code points, and &#xD800; to &#DFFF; are illegal, period, even if the encoding is UTF-16.

On the other hand, actual surrogate code units (as distinct from references) are as you say permitted in the UTF-16 encoding family only; they are not allowed in UTF-8 or the UTF-32 family.

go to newer or older post, or back to index or month or day