The myth of cross-product compatibility

by Michael S. Kaplan, published on 2006/08/26 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/08/26/712848.aspx


Editorial note: there will be a certain type of Drunkard's Walk feel to this post, but that is because the navigation is actually controlled by a specific customer's attempt to understand behavior in SQL Server. The timeline will be a little abbreviated, but I'll try and hit all of the high points....

The other day, when I was talking about Decimal vs. hexadecimal LCIDs, backcompat, and being weird, I made a statement about the irony of finding a bug that could have been found in many different COLLATIONS supported in SQL Server in a Unicode-only locale and a binary sort -- one of the specific collations that has nothing to add to either collation or code page behavior.

In fairness, I should take a bit of that back. The truth is that Unicode only, binary collations have one thing to add to the mix that other collations do not.

They can add a bug that adds inconsistency, makes SQLCLR integration a bit harder, and is very poorly documented, to boot!

To see the problem, let's take the following query and see which collations we are talking about (based on all the lessons we learned from this post and comments thereof!):

SELECT
    name,
    COLLATIONPROPERTY(name, 'CodePage') as CodePage,
    CONVERT(binary(4), COLLATIONPROPERTY(name, 'LCID')) as LCID,
    CONVERT(binary(4), COLLATIONPROPERTY(name, 'ComparisonStyle')) as ComparisonStyle,
    description
FROM
    ::fn_helpcollations()
WHERE
    COLLATIONPROPERTY(name, 'CodePage') = 0 AND
    name LIKE '%_BIN'

This will return just three rows:

Divehi_90_BIN           0 0x00000465 0x00000000 Divehi-90, binary sort
Indic_General_90_BIN    0 0x00000439 0x00000000 Indic-General-90, binary sort
Syriac_90_BIN           0 0x0000045A 0x00000000 Syriac-90, binary sort

Hmmm... where is the Georgian? I mean, Georgian is a Unicode-only locale!

Ok, we will try a slightly different query:

SELECT
    name,
    COLLATIONPROPERTY(name, 'CodePage') as CodePage,
    CONVERT(binary(4), COLLATIONPROPERTY(name, 'LCID')) as LCID,
    CONVERT(binary(4), COLLATIONPROPERTY(name, 'ComparisonStyle')) as ComparisonStyle,
    description
FROM
    ::fn_helpcollations()
WHERE 
    name LIKE 'Georgian%' AND
    name LIKE '%_BIN'

and run it. We get back just one row:

Georgian_Modern_Sort_BIN 1252 0x00010437 0x00000000 Georgian-Modern-Sort, binary sort

It thinks the code page is 1252? That is nothing like CultureInfo("ka-GE").TextInfo.ANSICodePage, at all! Somehow SQL Server has made Georgian a locale with a system code page that does not support any of the Georgian characters that are needed in the language.

So we already know that the sorting doesn't match -- now we know sometimes the codepages don't match, either?

Let's table that for a moment; we'll come back to it with the explanation.

It gets worse. That code page value of 0? That happens to be the number behind CP_ACP, the default system code page. Let's try that out with the following script:

use master
IF DB_ID (N'sql_test') IS NOT NULL
    DROP DATABASE sql_test;
GO

CREATE DATABASE sql_test;
GO

use sql_test

CREATE TABLE sql_cp (
    colD nvarchar (50) COLLATE Divehi_90_BIN NULL,
    colA nvarchar (50) COLLATE Arabic_BIN NULL
)

GO

use sql_test
INSERT INTO
    sql_cp (colD, colA)
VALUES (
    'ابةتثجحخدذرزسشصضطظعغ',
    'ابةتثجحخدذرزسشصضطظعغ')

SELECT * FROM sql_cp;

If you really did pass 0 to MultiByteToWideChar on a system with an Arabic system code page, it would convert the string via cp1256. What does the query return?

????????????????????    ابةتثجحخدذرزسشصضطظعغ

Hmmm.... it must have changed the meaning of 0 to mean the default database codepage. Which in this case is based on the server collation, which happens to be one of the Latin1_General collations.

Ok, we'll just call that one unexpected rather than a bug. But the fact that Georgian is no longer a Unicode only locale when both Windows and the .NET Framework think it is remains a problem.

What about other Unicode only locales like Armenian? Well, they are slightly worse off; since they have no collations of their own (no code page requirement and no unique sort beyond the default table means no unique collation in SQL Server), they have lost their identity, in a way.

And finally we know why this bug in Georgian exists -- because the Traditional Georgian sort is in the default table (along with lots of other Unicode only locales) and since it has the CodePage of 1252, the modern sort has to as well.

But this means that a whole lot of SQL Server collations do not match their analagous cultures in the .NET Framework or locales in Windows. And that is even before we get into the whole rogue version of the sorting tables issue I've mentioned in the past, or the newer Danish/Norwegian problems or the potential upcoming problem with Swedish/Finnish.

This is hardly the only reason for the changes that happened with WinFS, if you will notice it was not really mentioned at all. But these differences had a lot to do with the integration challanges, which had still not been fully worked at the point that the project's energies were redirected.

It is a problem that we in GIFT had to deal with ourselves with a managed child that was engineered to try to fix old problems but then which later had to find itself integrated back with its unmanaged parent for custom locales to work. And we did not even have the excuse of different teams since it was often the very same people. But we decided to dig in and solve the problems -- because that integration is so crucial.

What the next version of SQL Server needs here is a fix to the whole problem. A way to consistently represent the locale, codepage, and collation data that it consumes from Windows and then later on tries to use in concert with the operating system and the .NET Framework....

 

This post brought to you by (U+0f8a, a.k.a. TIBETAN SIGN GRU CAN RGYINGS)


no comments

go to newer or older post, or back to index or month or day