Once more into the UCS-2 breach, SQL Server marches

by Michael S. Kaplan, published on 2011/07/01 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/07/01/10182228.aspx

It is probably important for me to write fewer blogs in advance.

They are simply too much of a time sink.

I'll now try to avoid presenting some largely fictional numbers to prove my point. If I falter in this task, just pretend I'm a politician and you're on my staff, it'll go smoother.

Now the average blog that I finish writing some time between 4pm the night before and 6:51am the morning of? It gets reviewed a little, but I don't make very many changes.

But the ones that are written much earlier? I'll tweak them and change them delete stuff and add new stuff, trying to "get it right" in almost crazed Charlie Chaplin-esque way.

Though in the end I don't think they read better. In fact, all that over-editing comes out reading slightly worse.

And to add insult to injury, I tend to think of these older pieces as if they are somehow less interesting, more forgettable. Easier to move anytime something interesting comes along.

All that despite the fact that they too might have been the product of some inspiration.


Anyhow, just a couple of days ago, a Mike G who I don't believe to be either Mike G.1 or Mike G.2 asked me over in the Suggestion Box:

I noticed while combing over the SQL Server 2008R2 documentation that nvarchar and nchar only support the UCS-2 subset of UTF-16 (ref: Using Unicode Data) (i.e this would indicate to me that searching by strings or expecting string semantics for strings with surrogate pairs might have bugs), This would seem to me to be a critical oversight.

Now I wouldn't want to judge SQL Server's Unicode support by that Using Unicode Data article, since it does seem to mostly be talk about about a time before supplementary characters even existed as a notion, let alone before any were defined.

There actually are a few parts of SQL Server that readily support the semantic of supplementary characters as specific linguistic elements..

Like the *_BIN2 collations I mentioned in SQL Server has its own version of .NET "ordinal" comparisons and elsewhere.

And in several of the collations first added to SQL Server that I first talked about in On changing the world, or at least the way people order things in it.

Now the support isn't all encompassing, mind you.

I mean, what is first and foremost missing is the case information for scripts like Deseret.

And it may have a somewhat negative impact on the LDS church's opinion of this Blog for me to say it, but the fact that that case insensitive sorts will generally be lacking support for Deseret case folding may not be accurately cast as a "critical oversight".

The sentence before last might have held out Deseret as an example but in Unicode as it stands it is the only cased script currently encoded in Unicode. Though perhaps other parts of the SMP will reverse that trend in some future version of Unicode.

For now I'm inclined to say that despite SQL Server being principally on the UCS-2 side of the UCS-2 vs. UTF-16 debate, it is for most purposes "UTF-16 enough" and the fact that developers can use the .Net Framework to support all of Unicode allow databases to fully support UTF-16 and all of the supplementary characters in it improves the situation tremendously.

Unless you are one of the SQL admins that turns off SQLCLR support, I mean. :-)

You know, in reading over today's blog in the minutes before it is time to go live, I can't say I'm terribly impressed.

Maybe the ones written further in advance are superior.

Or maybe they both suck, a little bit....

Mike G. on 1 Jul 2011 9:20 AM:

Thanks for the timely response, I wasn't expecting something so soon. My concern is mostly related to Azure, where the admins do turn of SQLCLR support, and quite a bit more as well. Given the localization push on the horizon being aware of how SQL Server handles comparisons such as these will be critical, we're not worried about LDS so much as the PRC.

Michael S. Kaplan on 1 Jul 2011 9:59 AM:

My pleasure. :-)

In that case, the new SQL Server colations for Simplified Chinese that are based on government standards for pronunciation and stroke count/order should give you what you are principally looking for. the extra work to not truncate strings in a way that would split surrogate pairs is the only extra care you will need to take....

𐐔𐐲𐑀 𐐏𐐭𐐯𐑊 on 5 Jul 2011 2:41 PM:

I suppose you know the LDS Church no longer actively uses or promotes the Deseret Alphabet, and so their opinion of your Blog is probably safe.

Michael S. Kaplan on 5 Jul 2011 5:48 PM:

Ah, but times do change. Plus there are historical documents to worry about....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day