by Michael S. Kaplan, published on 2008/12/04 10:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/12/04/9173687.aspx
Previous blogs in this series of blogs on this Blog:
Now that I have gone out of my way to try to get people to stop thinking only about surrogate pairs, I am now going to talk about a case study that is about someone thinking only about surrogate pairs. :-)
But first a brief foray into something that I wouldn't usually be caught dead talking about....
Given the current economy, lots if people are looking to lower their overall TCO (total cost of ownership).
It tends to make SQL Server look a bit more compelling than Oracle in a lot of cases, not because SQL Server is faster and not just because the SQL Server solutions cost less, but also because the SQL Server DBAs do. :-)
Ok, enough of that -- enough flame bait of that kind, foray over. And onto the actual issue -- questions about migrating from Oracle to SQL Server that came up the other day:
Let us assume that our Customer is having 70,000 Unique characters in oracle . When we migrate this data to SQLServer we have ability to store 65536 unique characters in Unicode data types.
What would happen to remaining characters which are above 65536 Limit ? and how the remaining 4436 (70000-65536) will be stored in SQLServer?
Do we need to change any settings in SQLServer to achieve this (or) If we store any characters which fall beyond 65536 range in Unicode column like Nchar, Nvarchar and Ntext it will be automatically stored as surrogate pairs?
One final question
When we store Characters which is beyond 65536 range they are stored as surrogate pairs which will take up 4 bytes for each character.
So if I have a column with Nchar(4000) I will be able to store maximum of 2000 characters if all characters are beyond 65536 range, and I will be able to store 4000 characters if all the characters are within 65536 range.
Please confirm if my understanding is right.
That's a lot of questions packed into one question, isn't it? :-)
Now first of all, it is important to focus on the irony of being so worried about needing to use up to four bytes per Unicode code point rather than just two when one is looking at converting from a system that uses one, two, three, and four bytes per Unicode code point.
An irony made all the more delicious by the fact that so many Oracle installations use CESU-8 rather than UTF-8 (which Oracles calls UTF8 rather than AL32UTF8), which means that they are usually looking at converting from a system that uses one, two, three, or six bytes per Unicode code point....
Ok, delicious irony savoring over, now back to work.
Now obviously we can't be unhappy about storing supplementary characters (what the customer is thinking of as "the remaining characters which are above the 65336 limit" for a few reasons.
I mean, first of all that it is unlikely that they are using every Unicode code point from U+0001 to U+ffff, given the fact that so many of those code points are not actually usable, meaning that if they truly are using "70,0000 unique characters" then a lot more of them are supplementary characters than just 4,436 of them.
But also we have to realize that UTF-16 storage is going to not be worse than UTF-8 storage, the fact that they are stored in UTF-16 as surrogate pairs which will take up four bytes -- just as they do in Oracle (unless they use the old Oracle format that will take up SIX bytes in their CESU-8 encoding).
Data that really does run the gamut of Unicode is gonna take up the same kind of space, probably more, in UTF-8. the breakdown is like this:
Kind of puts it all in perspective, doesn't it? :-)
Moving on to the question about additional settings, the support is automatic for Unicode code points; no change have to be made, and it will work in any version of SQL Server that supports the Unicode data types (SQL Server 7.0 and later).
Easy!
And I think I have readily dismissed the storage concerns implicit in the last question -- any time the buffers are too small in SQL Server they are even too smaller in Oracle, and possibly too smallest in their most common Unicode encoding....
This blog brought to you by U+10ffff, a non-character holding the position of the biggest theoretical legal character in Unicode
referenced by
2009/06/29 UCS-2 to UTF-16, Part 11: Turning it up to Eleven!
2009/06/10 UCS-2 to UTF-16, Part 10: Variation[ Selector] on a theme...
2008/12/16 UCS-2 to UTF-16, Part 9: The torrents of breaking CharNext/CharPrev
2008/12/09 UCS-2 to UTF-16, Part 8: It's the end of the string as we know it (and I feel ellipses)