SQL Server has its own version of .NET "ordinal" comparisons

by Michael S. Kaplan, published on 2005/01/23 08:55 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/23/358946.aspx

In my prior post about Comparison confusion: INVARIANT vs. ORDINAL, I talked about the meaning of ordinal comparisons. They are the same as what the C Runtime calls a lexicographic comparison in functions like strcmp and wcscmp -- basically a binary comparison.

SQL Server has its own version of this in its collation mechanisms, known as Binary collations. Of the three, I have to say SQL Server has the best name. So lets give them 10 points here.

There is a topic in MSDN that discusses the differences between the different types of collations in SQL Server >= 2000, but for now I will just stick to the binary collations (I'll cover the other aspects of the SQL Server feature another day). In that topic, a binary collation is defined as follows:

In binary collations, comparisons and sorting are based strictly on the bit pattern of the characters. This is the fastest option. Because uppercase characters are stored with different bit patterns than their corresponding lowercase characters, and accented characters have different bit patterns than characters without accents, binary sort orders are always case-sensitive and accent sensitive. Binary collations also ignore dictionary sequences that have been defined for specific languages. They simply order the characters based on the relative value of the bit patterns that represent each character. While the bit patterns defined for Latin characters, such as 'A' or 'z', are such that binary sorting yields the correct results, the bit patterns for some extended characters in some code pages may be different than the ordering sequence defined in dictionaries for the language associated with a collation. This can lead to occasional ordering and comparison results that are different than what a speaker of the language might expect.

SQL Server also gets credit for not confusing matters further by having an "ignore case" version of the binary comparisons such as what is in the CRT stricmp function or in the upcoming version of the .NET Framework's OrdinalIgnoreCase comparison, which is useful but is probably confusing when you look at definitions like the above for binary orderings. So lets give SQL Server 10 more points.

In looking at a list of the binary collations:

Albanian_BIN
Arabic_BIN
Chinese_PRC_BIN
Chinese_PRC_Stroke_BIN
Chinese_Taiwan_Bopomofo_BIN
Chinese_Taiwan_Stroke_BIN
Cyrillic_General_BIN
Croatian_BIN
Czech_BIN

and so on, a few questions will pop in to the minds of people who have been following the conversation so far:

If it is a binary ordering that is not based on language characteristics, then why are there separate entries for different types of sorts in the same language, e.g. Chinese_PRC_BIN and Chinese_PRC_Stroke_BIN?
If it is a binary ordering that is not based on language characteristics, then why are there separate entries for different languages at all?

Taking those questions out of order, the answer to #2 is that the language groups are important since they help define code page information, which is crucial both for non-Unicode columns and for situations where Unicode columns have to be converted to some code page. The groups match the collation choices for the various non-binary collations in SQL Server. Since this is a valid (though slightly confusing) setting, we'll only take 2 points off for it. Validity goes a long way for me....

I indirectly explained #1 as well -- there is no difference between Chinese_PRC_BIN and Chinese_PRC_Stroke_BIN. They are both there because in the user interface in SQL Server you choose the language and then have the option of choosing binary in the same way that you can choose case sensitivity. But they should give identical results. This is obviously uglier since it serves no real purpose other than to e expose a small flaw in the collation model, so we'll need to take 5 points off for this one (since there are only as few). Obscurity goes a long way for me, too. :-)

One new wrinkle: starting with the next version of SQL Server (code name "Yukon"), which has a Beta 3 coming out soon, there is a new type of binary collation which has a _BIN2 suffix on the name rather than the _BIN suffix. The difference between the two can perhaps be gleaned from the respective descriptions:

Latin1_General_BIN Latin1-General, binary sort
Latin1_General_BIN2 Latin1-General, binary code point comparison sort

Now SQL Server (like Windows and most other Microsoft products) uses UTF-16 for its Unicode support, which means that supplementary characters are treated as surrogate pairs. And it is a little endian architecture, meaning the bytes would look reversed in a binary ordering that moved one 16-bit WORD at a time.

Somewhere between those two issues would be my guess for the difference between a "binary sort" and a "binary code point comparison" sort. Or maybe both. Or neither, even. I honestly am not sure. Let's look at the Books Online information:

Using Binary Collations

The following considerations will help you to decide whether old or new binary collations are appropriate for your Microsoft SQL Server implementation. Support for both BIN and BIN2 collations will continue in future SQL Server releases.

Binary collations sort data based on the sequence of coded values defined in a particular code page. A binary collation in SQL Server defines the language locale and the ANSI code page to be used, enforcing a binary sort order. Binary collations are useful in achieving improved application performance due to their relative simplicity.

Previous binary collations in SQL Server performed an incomplete code-point-to-code-point comparison for Unicode data, in that older SQL Server binary collations compared the first character as WCHAR, followed by a byte-by-byte comparison. For backward compatibility reasons, existing binary collation semantics will not be changed.

Guidelines for Using Binary Collations
If your Microsoft SQL Server 2005 applications interact with older versions of SQL Server that use binary collations, continue to use binary. Binary collations may be a more suitable choice for mixed environments.

Guidelines for Using BIN2 Collations
Binary collations in this release of SQL Server include a new set of pure code-point comparison collations. Customers can choose to migrate to the new binary collations to take advantage of true code-point comparisons, and they should utilize the new binary collations for development of new applications. The new BIN2 suffix identifies collation names that implement the new code-point collation semantics. In addition, a new comparison flag is added corresponding to BIN2 for the new binary sort. Advantages include simpler application development and clearer semantics.

Ok, so my guess about supplementary characters was (apparently) wrong, but the byte-based guess was right. I would give them 10 points for doing this work and documenting it, then subtract 7 for the confusion factor, which seems especially high. For those who are keeping score this means SQL Server is up 16 just on binary collations!

Note that neither the lexicographic comparisons of the C Runtime nor the ordinal comparions of the .NET Framework explain anything about what they are doing here. That certainly hurts both of them. But looking at the source for both of them, they are doing the right thing in both cases, and it probably would not have occurred to me that they might be wrong if SQL Server did not point out that they used to be. So I guess we can forgive everyone their trespasses here.

Except for the word "lexicographic" -- that just seems unforgivable, to me. Perhaps I am oversensitive, but I do not think that word means what they think it means....

This post brought to you by "இ" (U+0b87, a.k.a. TAMIL LETTER I)

# Michael Kaplan on 23 Jan 2005 8:10 PM:

Well, think about the one word that they have to use to tell people what both the ordinal and the lexicographic comparisons --

They say it is a binary comparison. Since that is SQL Server's name anyway, they have the one name that is most likely to be intuitive!

# Peter Ibbotson on 27 Jan 2005 3:24 AM:

Hmm... We reckon that something the SQL server team did with binary collation is borked over the 2000->2005 step, in particular there are some semi obscure problems with space(0x20) vs nul(0x0) (in trailing characters of a string)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/07/01 Once more into the UCS-2 breach, SQL Server marches

2008/11/01 What's the shape of the sort?

2006/12/07 SQL and the CLR: Part 1 (the things we can make work well)

2005/09/14 Extending collation support in SQL Server and Jet, Part 1 (the broad strokes)

go to newer or older post, or back to index or month or day