String.Compare is for sissies (not for people who want SQLCLR consistency)

by Michael S. Kaplan, published on 2005/06/09 11:25 -07:00, original URI:

Yesterday at the TechEd booth in which I was sitting, someone was asking me how to get the comparisons in the .NET Framework to be consistent with the ones in SQL Server 2005.

This is a lot harder than it has to be, unfortunately!

First we will start with the constants used by Windows from winnls.h in the Platform SDK:

#define NORM_IGNORECASE       0x00000001  // ignore case
#define NORM_IGNORENONSPACE   0x00000002 
// ignore nonspacing chars
#define NORM_IGNORESYMBOLS    0x00000004 
// ignore symbols
#define NORM_IGNOREKANATYPE   0x00010000  // ignore kanatype
#define NORM_IGNOREWIDTH      0x00020000 
// ignore width
#define SORT_STRINGSORT       0x00001000  // use string sort method

These definitions almost match the ones used by the .NET Framework, in the CompareOptions enumeration:

CompareOptions.IgnoreCase              1   (0x00000001)
CompareOptions.IgnoreKanaType          8   (0x00000008)
CompareOptions.IgnoreNonSpace          2   (0x00000002)
CompareOptions.IgnoreSymbols           4   (0x00000004)
CompareOptions.IgnoreWidth            16   (0x00000010)
CompareOptions.None                    0   (0x00000000)
CompareOptions.Ordinal        1073741824   (0x40000000)
CompareOptions.StringSort      536870912   (0x20000000)

You can see where the differences are: NORM_IGNOREWIDTH/CompareOptions.IgnoreWidth, NORM_IGNOREKANATYPE/CompareOptions.IgnoreKanaType, and SORT_STRINGSORT/CompareOptions.StringSort. Let us keep these two differences in mind, they will become important soon.

In both Shiloh (SQL Server 2000) and Yukon (SQL Server 2005), the very useful COLLATIONPROPERTY has two very cool attributes you can grab:

Thus if I run the following query:

    COLLATIONPROPERTY(name, 'CodePage') as CodePage,
    COLLATIONPROPERTY(name, 'ComparisonStyle') as ComparisonStyle,
FROM ::fn_helpcollations()

You get back a nice big table that has the exact information for the code page used in WideCharToMultiByte/MultiByteToWideChar calls, and the LCID/dwCmpFlags values to use in a CompareString call (excerpt here, I do not want top duplicate the whole 1029-row table Yukon returns!):

Collation name       CP   LCID Flags   Description
Albanian_BIN         1250 1052 0       Albanian, binary sort
Albanian_BIN2        1250 1052 0       Albanian, binary code point comparison sort
Albanian_CI_AI       1250 1052 196611  Albanian, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Albanian_CI_AI_WS    1250 1052 65539   Albanian, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Albanian_CI_AI_KS    1250 1052 131075  Albanian, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Albanian_CI_AI_KS_WS 1250 1052 3       Albanian, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Albanian_CI_AS       1250 1052 196609  Albanian, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Albanian_CI_AS_WS    1250 1052 65537   Albanian, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Albanian_CI_AS_KS    1250 1052 131073  Albanian, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Albanian_CI_AS_KS_WS 1250 1052 1       Albanian, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive
Albanian_CS_AI       1250 1052 196610  Albanian, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Albanian_CS_AI_WS    1250 1052 65538   Albanian, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Albanian_CS_AI_KS    1250 1052 131074  Albanian, case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Albanian_CS_AI_KS_WS 1250 1052 2       Albanian, case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Albanian_CS_AS       1250 1052 196608  Albanian, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Albanian_CS_AS_WS    1250 1052 65536   Albanian, case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Albanian_CS_AS_KS    1250 1052 131072  Albanian, case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Albanian_CS_AS_KS_WS 1250 1052 0       Albanian, case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive

I could not find an easy T-SQL way to make the flag values show up as hexidecimal. But the mappings would be:

Collation name       Flags     
Albanian_BIN         0x00000000
Albanian_BIN2        0x00000000
Albanian_CI_AI       0x00030003
Albanian_CI_AI_WS    0x00010003
Albanian_CI_AI_KS    0x00020003
Albanian_CI_AI_KS_WS 0x00000003
Albanian_CI_AS       0x00030001
Albanian_CI_AS_WS    0x00010001
Albanian_CI_AS_KS    0x00020001
Albanian_CI_AS_KS_WS 0x00000001
Albanian_CS_AI       0x00030002
Albanian_CS_AI_WS    0x00010002
Albanian_CS_AI_KS    0x00020002
Albanian_CS_AI_KS_WS 0x00000002
Albanian_CS_AS       0x00030000
Albanian_CS_AS_WS    0x00010000
Albanian_CS_AS_KS    0x00020000
Albanian_CS_AS_KS_WS 0x00000000

Updated query to give the hex values for LCID and ComparisonStyle directly (thank you, James Todd!):

    COLLATIONPROPERTY(name, 'CodePage') as CodePage,
    CONVERT(binary(4), COLLATIONPROPERTY(name, 'LCID')) as LCID,
    CONVERT(binary(4), COLLATIONPROPERTY(name, 'ComparisonStyle')) as ComparisonStyle,
FROM ::fn_helpcollations()

So, if you take that flag value, convert all of the 0x00020000 to 0x00000010 (and 0x00010000 to 0x00000008!), you then have a value you can plug into the CompareInfo.Compare method.

But none of this will work with String.Compare!

If you want behavior between SQL Server and the CLR to have any chance of parity, you must use the methods off of CompareInfo.

Parity will seldom be 100%, because all of the SQL Server 2000 collations are based on the tables that shipped with Windows 2000 (just after Beta 1, before Beta 2), and Whidbey actually uses the same tables as Windows Server 2003. You will get much closer with the *_90 collations that were added in Yukon, though.

A lot of dancing around and converting and munging to get parity -- too much, I think. There is definitely room for people to improve the experience here in future releases!


This post brought to you by "วพ" (U+01fe, a.k.a. LATIN CAPITAL LETTER O WITH STROKE AND ACUTE)

# James Todd on Thursday, June 09, 2005 1:21 PM:

Hi Michael,

To convert the comparison style to hexadecimal, convert it into a 4-byte binary, like so:

COLLATIONPROPERTY(name, 'CodePage') as CodePage,
CONVERT(binary(4), COLLATIONPROPERTY(name, 'ComparisonStyle')) as ComparisonStyle,
FROM ::fn_helpcollations()

That gets me results like I expect:

Albanian_BIN 1250 1052 0x00000000 Albanian, binary sort

Albanian_CI_AI 1250 1052 0x00030003 Albanian, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive


# Maurits on Thursday, June 09, 2005 3:37 PM:

involves adding a custom system message:

SQL is:
execute sp_addmessage 50001, 1, '0x%x'

[123 in hex] = formatmessage(50001, 123)

execute sp_dropmessage 50001

output is:
123 in hex

# Michael S. Kaplan on Thursday, June 09, 2005 4:12 PM:

James, awesome -- that is just what I was looking for. Thanks!

# James Todd on Thursday, June 09, 2005 8:04 PM:

You're quite welcome Michael, I'm happy to help. And thank you for this interesting topic!

# Maurits on Friday, June 10, 2005 12:47 PM:

Just to clarify, I wasn't trying to trump James' SQL solution, which is flat-out better than mine... didn't think of CAST()-ing to binary. I simple didn't see before I posted. A moderation threading issue, perhaps? :)

# Michael S. Kaplan on Friday, June 10, 2005 4:43 PM:

Maurits, no worries. James posted his solution "anonymously", so it did not become visible until after I unmoderated it. I did not assume any ill intent. :-)

# Richard Mitchell on Wednesday, June 29, 2005 5:39 AM:

I am having an interesting problem sorting this out ( excuse the pun ). When using a binary collation for example _BIN, _BIN2. These two sort orders are slightly different however the flags that would be passed to CompareInfo.Compare are identical...
Albanian_BIN 1250 1052 0 Albanian, binary sort
Albanian_BIN2 1250 1052 0 Albanian, binary code point comparison sort.

I'm especially having problems sorting for
example the strings 100,10-2,102 consistently
in SQL_Latin1_CP1_CI_AS, even when ordering by Latin1_General_BIN it doesn't work without setting the flag for StringSort on the CompareInfo.Compare() call.

# Michael S. Kaplan on Wednesday, June 29, 2005 7:41 AM:

Hi Richard --

There is no way to make things work exactly with binary collations, though _BIN is *almost* identical to an Ordinal sort. There is no collation in .NET that is the same as _BIN2 (I'll post about that another day). Note that the flag settings are those that would be passed to the Win32 CompareString function, which does not support binary comparisons -- which is why the Flags values are 0 in this case....

For the old SQL collations, there is no way to match completely, since the order is a pre-existing one from versions of SQL Server before 7.0.

# Anonymous on Tuesday, July 05, 2005 12:51 PM:

Since these immortal words were spoken by the voice of Tim Blaney to Ally Sheedy, I think every...

referenced by

2007/10/25 Not all in sync quite yet (aka SQL and the CLR and Windows and .NET)

2006/08/26 The myth of cross-product compatibility

2006/03/18 Just in case you thought I was blathering. :-)

2005/11/29 SQL Server's cultural sensitivities

2005/11/08 SQL Compatibility collations are a bit too retro for me

2005/11/07 SQL Server and the CLR have different ways of getting the money

2005/10/09 Extending collation support in SQL Server and Jet, Part 3 (THAT CLASS)

2005/09/18 Extending collation support in SQL Server and Jet, Part 2 (generating sort keys)

2005/07/08 Real developers use CompareInfo's Compare (Part 1)

2005/07/06 Coming soon

2005/07/05 'Need more input, Stephanie!'

2005/06/16 More on locales in SQL Server

2005/06/10 TechEd Orlando: Day 5

2005/06/10 TechEd Orlando: Day 4

go to newer or older post, or back to index or month or day