Collation != Case (a.k.a. Collation <> Case)

by Michael S. Kaplan, published on 2006/01/11 09:51 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/11/511557.aspx

Ian Young asked the following question recently:

There seems to be a discrepancy in SQL Server in the way case is treated between the UPPER(), LOWER() functions and case-sensitive comparisons.
I would have thought that the following should produce no rows (where Numbers is a suitably large table of integers):

SELECT n, c,
UNICODE(LOWER(c)) AS ln, LOWER(c) AS lc,
UNICODE(UPPER(c)) AS un, UPPER(c) AS uc FROM (
SELECT n, NCHAR(n) COLLATE Latin1_General_CS_AS AS c
FROM Numbers
WHERE n BETWEEN 32 AND 65533
) AS A
WHERE UNICODE(LOWER(c)) <> UNICODE(UPPER(c))
AND LOWER(c) = UPPER(c)

That is, we're looking for Unicode characters where conversion to upper- and lowercase yield different characters (according to their Unicode codepoint), but compare as equal with a case-sensitive, accent-sensitive comparison.
The result is actually (SQL Server 2000, 2005) 176 rows consisting of a bunch of characters (U+04D0 -- U+04F9) from the Cyrillic block and many of the precomposed polytonic Greek characters in the Greek Extended block (U+1F00 -- U+1FFB).
This result is the same irrespective of whether the accent-sensitive or -insensitive collation is used. It also (only tested on SQL Server 2000) gets the same results for Greek, Cyrillic_General, Ukrainian, and Macedonian collations.

Am I missing something here?

A suitably large table of integers:

CREATE TABLE Numbers (n int PRIMARY KEY)
INSERT INTO Numbers
SELECT D0.i + D1.i*10 + D2.i*100 + D3.i*1000 + D4.i*10000 FROM
(SELECT 0 AS i UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6
UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) AS D0,
(SELECT 0 AS i UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6
UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) AS D1,
(SELECT 0 AS i UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6
UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) AS D2,
(SELECT 0 AS i UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6
UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) AS D3,
(SELECT 0 AS i UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3
UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6
UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) AS D4

Ian is correct, there are a large number of characters that fall into this category.

The reason for this is that there are many differences between the operations involved with converting the case of a function and doing case insensitive comparisons.

For example:

The case weight in collation is used in many languages that do not have case but have some other common tertiary distinction that needs to be represented, for example final forms in Hebrew.
Other times, the case weight in collation is used even in languages that have case but also have the need to support other tertiary distinctions, like the Greek final sigma being not exactly the capital simga and not exactly the small one.
In Thai and Indic languages, the case weight is also used in some cases to show tertiary distinctions.
There are many 'one way' case conversions in Unicode that would not be candidates for an OS casing table (which generally needs data that is round-tripable) but which are just fine in collation.
There are also many complex casing operations that cannot be done in the simple casing support of the OS (like the Sharp S, discussed previously).

So in summary, the difference between the operations that Ian noted is an intentional one for linguistic reasons, though I think he wins an award for the most extensive investigation of the issue. :-)

This post brought to you by "ς" (U+03c2, GREEK SMALL LETTER FINAL SIGMA)

# Maurits [MSFT] on 11 Jan 2006 12:39 PM:

This is based on binary operations rather than having to convert to decimal...

create table #nybble (n int)
create table #byte (b int)
create table #word (w int)

insert into #nybble values (0)
insert into #nybble select n + 1 from #nybble
insert into #nybble select n + 2 from #nybble
insert into #nybble select n + 4 from #nybble
insert into #nybble select n + 8 from #nybble

insert into #byte select (n1.n * 16) | n2.n from #nybble n1, #nybble n2
insert into #word select (b1.b * 256) | b2.b from #byte b1, #byte b2

--select * from #word order by w

drop table #nybble
drop table #byte
drop table #word

# Abhinaba Basu [MSFT] on 12 Jan 2006 1:04 AM:

We've been bitten by this badly in the development of Visual Studio team system.

The issue is that source control needs files/folders stored in a Yukon server to be case insensitive because ultimately they are paths on Windows file system which is not case sensitive. The problem is that on Yukon you cannot say something like OrdinalIgnoreCase and we need to use case-insensitive collation sequence which depends on server locale. So the problem is on Turkish locale on my client I can have folders i and İ. However when this gets checked into a Yukon server running Turkish windows it fails as in the server collation both are same!!

# Michael S. Kaplan on 12 Jan 2006 1:17 AM:

Yes, this is a good example of an implementation that is not doing things 100%correctly -- definitely a topic I will be talking more about on another day....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/09/26 If case conversion were harder, people would do it less

2008/02/24 The idea has to do more than just make sense to me (aka How S-Sharp are *you* feeling today?)

2007/10/25 Jokes that aren't really all that funny in the end (aka At least SQL Server isn't on our case)

2007/08/24 Every character has a story #28: U+1e9e (CAPITAL SHARP S)

2007/07/31 If this post really describes a bug, would I actually put it in the WYNN column?

2007/04/10 When methods use collation to 'disturb the peace' we charge them with being 'out of sorts'

2006/12/07 SQL and the CLR: Part 1 (the things we can make work well)

2006/08/08 Collation != case, still

2006/06/02 What the @!#$% is the TERTIARY_WEIGHTS() function for?

2006/03/15 Casing and IgnoreCase are still not the same thing....

go to newer or older post, or back to index or month or day