Not all in sync quite yet (aka SQL and the CLR and Windows and .NET)

by Michael S. Kaplan, published on 2007/10/25 15:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/25/5686727.aspx


Regular reader Jan Kučera asks over in the Suggestion Box:

Hello,

Yet another possible suggestion from my thoughts... If one has (yes, I am the one :)) a central database for multi-cultural data (that means, in one column the entries could be in English, Hebrew, Chinese and so on) and expects it to be quered from multiple cultures, what collation should one choose for the database?

I guess that - to honor the "sort in locale, which user expects - the user's locale" rule - on should use COLLATE in the SELECT statement. But how do I get the SQL value for COLLATE from CultureInfo instance?

And last one, how about specifying the collation in LINQ to SQL? I haven't seen any LINQ example using collate keyword so I'm getting afraid this was quite forgotten in LINQ, wasn't it?

Thank you for sharing any of your thoughts,

    Jan

Jan's guess is correct, though I must admit that he is also correct that SQL Server is not going out of its way to make the mapping easy!

But I'll try to help out here a bit if I can, as it ends up being easier than it looks....

We'll start with a SQL query, the one from my old post String.Compare is for sissies (not for people who want SQLCLR consistency), which by the way when I was looking at stats the other day is inexplicably one of my often-read posts! :-)

The query:

SELECT
    name,
    COLLATIONPROPERTY(name, 'CodePage') as CodePage,
    CONVERT(binary(4), COLLATIONPROPERTY(name, 'LCID')) as LCID,
    CONVERT(binary(4), COLLATIONPROPERTY(name, 'ComparisonStyle')) as ComparisonStyle,
    description
FROM ::fn_helpcollations()

Now this will give you the huge master list of all the collations, expanded for the various styles but contracted to make sure identical sorts with identical code pages have only a single collation to represent them.

Obviously one could add a WHERE clause to translate CultureInfo.CompareInfo.LCID to a SQL collation:

WHERE COLLATIONPROPERTY(name, 'LCID') = 1036

and then you will get all of the French collations, for example.

Now at this point the collations that have been folded together intrude.

Let's tale the following three LCIDs (which use the same sort in SQL Server 2005 and earlier though not in Vista as I discuss in The disunification of Norwegian and Danish sorting):

If we run a query looking for any of them:

SELECT
    name,
    COLLATIONPROPERTY(name, 'CodePage') as CodePage,
    CONVERT(binary(4), COLLATIONPROPERTY(name, 'LCID')) as LCID,
    CONVERT(binary(4), COLLATIONPROPERTY(name, 'ComparisonStyle')) as ComparisonStyle,
    description
FROM ::fn_helpcollations()
WHERE
    COLLATIONPROPERTY(name, 'LCID') = 1030 OR
    COLLATIONPROPERTY(name, 'LCID') = 1044 OR
    COLLATIONPROPERTY(name, 'LCID') = 2068

Then in SQL Server 2005 it will return 20 rows:

Danish_Norwegian_BIN 1252 0x00000406 0x00000000 Danish-Norwegian, binary sort
Danish_Norwegian_BIN2 1252 0x00000406 0x00000000 Danish-Norwegian, binary code point comparison sort
Danish_Norwegian_CI_AI 1252 0x00000406 0x00030003 Danish-Norwegian, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Danish_Norwegian_CI_AI_WS 1252 0x00000406 0x00010003 Danish-Norwegian, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Danish_Norwegian_CI_AI_KS 1252 0x00000406 0x00020003 Danish-Norwegian, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Danish_Norwegian_CI_AI_KS_WS 1252 0x00000406 0x00000003 Danish-Norwegian, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Danish_Norwegian_CI_AS 1252 0x00000406 0x00030001 Danish-Norwegian, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Danish_Norwegian_CI_AS_WS 1252 0x00000406 0x00010001 Danish-Norwegian, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Danish_Norwegian_CI_AS_KS 1252 0x00000406 0x00020001 Danish-Norwegian, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Danish_Norwegian_CI_AS_KS_WS 1252 0x00000406 0x00000001 Danish-Norwegian, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive
Danish_Norwegian_CS_AI 1252 0x00000406 0x00030002 Danish-Norwegian, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Danish_Norwegian_CS_AI_WS 1252 0x00000406 0x00010002 Danish-Norwegian, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Danish_Norwegian_CS_AI_KS 1252 0x00000406 0x00020002 Danish-Norwegian, case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Danish_Norwegian_CS_AI_KS_WS 1252 0x00000406 0x00000002 Danish-Norwegian, case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Danish_Norwegian_CS_AS 1252 0x00000406 0x00030000 Danish-Norwegian, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Danish_Norwegian_CS_AS_WS 1252 0x00000406 0x00010000 Danish-Norwegian, case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Danish_Norwegian_CS_AS_KS 1252 0x00000406 0x00020000 Danish-Norwegian, case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Danish_Norwegian_CS_AS_KS_WS 1252 0x00000406 0x00000000 Danish-Norwegian, case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive
SQL_Danish_Pref_CP1_CI_AS 1252 0x00000406 0x00030001 Danish-Norwegian, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 183 on Code Page 1252 for non-Unicode Data
SQL_EBCDIC277_CP1_CS_AS 1252 0x00000406 0x00030000 Danish-Norwegian, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 212 on Code Page 1252 for non-Unicode Data

Notice that none of them fit under anything other than 0x0406, Danish.

So how do you get to this collation from the other two LCIDs?

Unfortunately, there is no programmatic way out there and the only place it happens to show up in documentation is a table in the documentation topic entitled Collation Settings in Setup, which maps locales from Windows 2000/XP/Server 2003 to the ideal collation to use in SQL Server (you can ignore the various flag settings and use the ones you prefer, with one sxception -- US English should use Latin1_General, NOT SQL_Latin1_General_CP1_CI_AS, because as I have pointed out time and time again, SQL comaptibility collations really suck. :-) 

Going the other direction is easier, since you can always create a CultureInfo or a CompareInfo using the LCID that is available to every collation.

To answer that last question, as LINQ to SQL: .NET Language-Integrated Query for Relational Data points out:

SQL uses collations to determine equality and ordering of strings. These can be specified on a SQL Server Instance, a database, a table column, or an expression.

The translations of the functions implemented so far do not change the collation or specify a different collation on the translated expressions. So if the default collation is case-insensitive, functions like CompareTo or IndexOf can give results that differ from what the (case sensitive) .NET functions would give.

The methods StartsWith(str)/EndsWith(str) assume the argument str is a constant or an expression that is evaluated on the client. That is, it is currently not possible to use a column for str.

There does not appear to be any inbuilt solution in LINQ, unfortunately.

I'll continue on with the attempting to solve the problem via description as we wait for the documenting of the workarounds...

 

This post was brought to you by a bunch of Unicode characters....


# Michael S. Kaplan on 26 Oct 2007 5:34 AM:

Now I believe there are hooks here like GetCultureInfo(string, string) (ref) and such, but I am not sure how much of that is hooked up on the SQLCLR side yet?

# Jan Kučera on 31 Oct 2007 4:52 AM:

Hello Michael,

thank you for getting to it, it explains a lot to me. Well I'm completely unfamiliar with SQLCLR so I'll have to check this technology out.

So, to ensure my thoughts: if there is no culture prefered over another one in queries, it completely does not matter what the server's default/database collation is set to?

If I use COLLATE, does the server store indices for later use? Or could I somehow specify for which cultures it should build the indices?

     Thanks, Jan

# Michael S. Kaplan on 31 Oct 2007 8:06 AM:

I talk about how to index specific collations separately in this post....

# Michael S. Kaplan on 31 Oct 2007 11:39 AM:

As for the server/db collation, there are specific benefits to choosing them carefully, but that is a much larger topic, one that goes beyond the scope of what I was covering here. Another day....


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/03/27 The disunification of Norwegian and Danish sorting ( SQL Server 2008 Edition!)

2007/12/11 In SQL Server, there is the rest of Unicode (aka the SiaO Incompleteness Theorem)

2007/11/26 When yesterday's workaround becomes tomorrow's potential solution...

go to newer or older post, or back to index or month or day