Validation of Unicode text is growing up

by Michael S. Kaplan, published on 2006/12/05 11:17 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/12/05/1212917.aspx


The other day, in response to my post How are the file names encoded?, Matt Selz commented:

If NTFS allows any unsigned short, would it be more accurate to say that NTFS does not do any encoding?  Should one instead say it is the Win32 subsystem which encodes and decodes characters as UTF-16, and then stores them in and reads them from the raw NTFS file name buffer? 

While this would technically be more accurate not just in this case, but in many others such as the definition of identifiers and the data store in .NET's System.String, it obviously does not sound very good. So it is unlikely that all of those descriptions would change.

However, there are more and more processes thats work from a higher standard, as I have already mentioned.

People sometimes find themselves running into problems with the stricter definitions, like in Michael A.'s case:

Hi all

I have a customer who is trying to write data to an XML file by using the Dataset method WriteXml.

This throws an exception :

Invalid high surrogate character (0xDCBA). A high surrogate character must have a value from range (0xD800 - 0xDBFF)
   at System.Xml.XmlTextEncoder.Write(String text)
   at System.Xml.XmlTextWriter.WriteString(String text)
   at System.Data.DataTextWriter.WriteString(String text)
   at System.Data.XmlDataTreeWriter.XmlDataRowWriter(DataRow row, String encodedTableName)
   at System.Data.XmlDataTreeWriter.Save(XmlWriter xw, Boolean writeSchema)
   at System.Data.DataSet.WriteXml(String fileName, XmlWriteMode mode)
   at LDAP_udtræk_V2.Logic.Export.ExportMain.exportToXMLFile(String path) in C:\Documents and Settings\...


To me this seems to be the correct behavior since if we translate the hex to decimal we get:

  Invalid high surrogate character ( 56506 ). A high surrogate character must have a value from range ( 55296 - 56319 )

which indeed shows that the value is out of range.

The data is... ...likely to be encrypted.

So, is there a possibility that the customer could get this error by accident because of the data for some reason contains values that are out of range
because of the encryption or any other reason?

I’m not too good with how Unicode and surrogate pairs work, so that is why I’m here.

Any ideas appreciated.

(Don't even get me started on the exception text that talks about a "surrogate character", please -- been there, done that, seen the movie!)

Clearly this is a case where some extra validation is going on in System.Xml, which is noticing that the less stringent System.String which does not validate for "legal" Unicode has passed on something that it considers to be crap. Since it easy to imagine an encryption process creating data that is not a valid Unicode string by these additional rules....

So while I doubt the old definitions would be dumbed down in the documentation, I expect more and more of the processes that involve issues like international standards will have stricter rules applied to them. Those people relying on the more forgiving implementation pieces like System.String should beware!

 

This post brought to you by U+D8ff, the last high surrogate code point -- not a surrogate character!
(This code point has come to terms with his lack of character-ness, but has mentioned that the fact that no one else has may put him into therapy)


# Adam on 5 Dec 2006 4:47 PM:

"Since it easy to imagine an encryption process creating data that is not a valid Unicode string..."

Well, seeing as encryption algorithms tend to produce binary data, I'd have thought this was pretty obvious. Not only that, but not even all *valid* unicode strings are representable in XML. Any unicode string containing the characters U+0000 - U+0008, U+000b, U+000c or U+000e - U+0001f is not allowed in XML.

# Maurits [MSFT] on 5 Dec 2006 7:02 PM:

Base64 is a common way to make such strings fit in XML.

# Adam on 6 Dec 2006 4:30 AM:

Base-64? In UTF-16? You're only using 6 out of every 16 bits!!! For Moore's sake, *never* store binary data as base64'd utf-16!

At least *try* something along the lines of base32k:

http://lists.xml.org/archives/xml-dev/200307/msg00505.html

http://lists.xml.org/archives/xml-dev/200307/msg00507.html

It may not be a published standard of any kind, but it shouldn't be that hard to write a robust codec for it and release it under the BSD license (or similar) for whoever needs to interoperate with you.

# Maurits [MSFT] on 6 Dec 2006 11:50 AM:

> You're only using 6 out of every 16 bits

It's still better than my other suggestion of

<?xml ...?>

<binary_data>

   <bit value="on" />

   <bit value="off" />

   <bit value="on" />

   <bit value="on" />

</binary_data>

http://channel9.msdn.com/ShowPost.aspx?PostID=105819

# Adam on 6 Dec 2006 12:17 PM:

*has heart attack*

ZOMG! That's stunningly beautiful in a mushroom-cloud kind of way! :)


referenced by

2008/04/09 Fight the Future? (#11 of ??), aka Microsoft is giving this character nada weight but lotsa importance

go to newer or older post, or back to index or month or day