by Michael S. Kaplan, published on 2006/06/05 09:54 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/06/05/617896.aspx
DanS12345 asked via 'Suggest a topic' the following question:
The documentation for SQL Server 2005 says that xml is stored in UTF-16 encoding. In fact the doc's make a point that xml is stored differently than NCHARs, which use UCS-2. This does not seem to be the case. I have done some experiments that seem to indicate that xml with characters that are not in the BMP are not proprerly handled. SQL Server seems to handle xml the same way it handles regular CHARs and NCHARS. Can you clarify what is going on?
Thanks,
Dan
There are some xml tags in the following, I hope they make it through.
Here is an example using 𠀀 a unicode character from plane 2. I've tried this with many different other non-BMP characters and also loading from a file with the character code in it instead of literal text with an entity and the results are the same. This uses XQuery to measure the length of the string inside the A tag.
DECLARE @xml xml
SET @xml = '<A>𠀀</A>'
SELECT @xml.query('string-length(.)') AS RETURNS
--------
RETURNS
2
i.e. .query sees 𠀀 as 2 characters, not 1. This is how the SQL Server 2005 documents that NCHAR's, i.e. text stored as UCS-2, will behave in SQL Server.
Just to confirm the treatment of UTF outside of SQL Server I used some other XML tools to see what results they produced.
The XQuery that measures string-length when run through Saxon sees only 1 character, as expected
Likewise if <A>𠀀</A> is run through the XSLT program shown below run by Microsoft MSXML (4 or 6), or Saxon 8 also see only 1 character as expected.
<?xml version='1.0'?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="http://blogs.msdn.com/A">
<xsl:value-of select="string-length(.)"/>
</xsl:template>
</xsl:stylesheet>
However the same XSLT program run through .NET's System.Xml.Xslt (1.1 or 2.0) sees 2 characters, which seems incorrect.
Here is another XQuery to extract the 𠀀 character. This says that the 𠀀 character is not allowed.
DECLARE @xml xml
SET @xml = N'<A>𠀀</A>'
SELECT @xml.query('<A>{substring(.,1,1)}</A>')
--------
Msg 6309, Level 16, State 1, Line 3
XML well-formedness check: the data for node 'NoName' contains a character (0xD840) which is not allowed in XML.
Here is the same thing with a BMP character, which works as expected.
DECLARE @xml xml
SET @xml = N'<A>A</A>'
SELECT @xml.query('<A>{substring(.,1,1)}</A>')
----------
<A>A</A>
And one more example doing a comparison. This comparison returns false, which isn't true.
DECLARE @xml xml
SET @xml = N'<A>𠀀</A>'
SELECT @xml.query('substring(.,1,1) eq "𠀀"')
-------
false
I can sort of understand this last example; it implies it is treating the parameter of the .query function as an ordinary string rather than xml. But that would not be a correct way to handle this.
So what's going on here, is SQL Server really using UTF-16 for xml or is it just storing xml as a plain ol' NCHAR, or is there some other fine details about managing UTF that is not obvious.
There is a bit of confusion here -- though not on Dan's part. It is on the part of a large and complex project (SQL Server) that has some components with full support of UTF-16 and some with no support of UTF-16, and most of them in between those two extremes.
I'll try and separate some of this out now. :-)
A claim of UTF-16 support is not the same thing as a claim of NCR support of supplementary characters.
What it means is that U+20000, if stored in UTF-16 (which means underneath it will be U+d840 U+dc00), can be stored. There is no implied promise that it will parse the string to convert NCRs into characters just because they are being stored (in truth it will sometimes be parsed and sometimes not depending on the component).
The reason NCRs exist in both HTML and XML is to allow either the parsing or the storage of text outside of the encoding being used -- in this case to store UTF-32 inside of text that is not encoded in UTF-32.
On the other hand, look at the error message that came back in one of those cases:
Msg 6309, Level 16, State 1, Line 3
XML well-formedness check: the data for node 'NoName' contains a character (0xD840) which is not allowed in XML.
which of course makes it clear that in some cases it does indeed do the parsing of NCRs.
I believe the trouble here is that often XML in SQL Server is processed as UTF-8 rather than UTF-16, and it is illegal to use surrogate code units inside of UTF-8. However, since you put it in as UTF-16 text (albeit with a UTF-32 NCR) you managed to find a bug -- a point where it converted the NCR to UTF-16 but then converted it all to UTF-8 as is.
I probably would not have made the claim in the first place that SQL Server 2000 or SQL Server 2005 supported much beyond UCS-2, with some minor support hooks there for UTF-16 that happen due to certain components being a bit further along in their evolution
The length issue is kind of the same -- the length of a string in NCHARs of a single supplementary character is expected to be 2, not 1. This is true whether SQL Server supported UTF-16 or not (posts like this one discuss the length issue in more detail).
In the end, I would call SQL Server more "surrogate neutral" than "surrogate supporting". :-)
Though this, so many years after the surrogate mechanism was first added to Unicode in version 2.1, is bad enough for me to mark this post on the "Unicode Lame List". It is time to get this fixed, I think....
This post brought to you by ㄛ (U+311b, BOPOMOFO LETTER O)
(A character listed in unihan as being derived from U+20000, and there is some slight resemblence!)
referenced by