In SQL Server, the distance between A and Z is wider than you might think!

by Michael S. Kaplan, published on 2007/12/05 10:16 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/12/05/6660685.aspx


The other day George asked me via the Contact link:

I use SQL Server quite a bit, but I just hit some confusing behavior reported by a customer. It isn't a localization problem but it does involve sorting or at least matching.

I think I have tracked down the problem, but I'm not sure. Is it possible that

    LIKE '[A-Z]'

and

    LIKE '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]'

would return different results?

Do you know what is going on here?

Indeed I do, George. And it is an internationalization problem, and a collation one, too!

If you look at the Windows table (which are slightly ahead of the SQL Server ones), there are about 1,229 letters in the Latin script, all but 20 of which are between A and Z, inclusive.

Now if you take the SQL Server case the numbers look slightly less scary (something like 979 letters in the Latin script, all but 15 of which are between A and Z, inclusive) but either way it puts that expression with the explicit ABCDEFGHIJKLMNOPQRSTUVWXYZ to shame, doesn't it? :-)

Now obviously if you are not using Unicode columns then you aren't really missing out on either 1,209 or 984 characters, but even then there will be some letters missed.

I'll get more info some other interesting nuances about these expressions another day....

 

This post brought to you by Z (U+005a, LATIN CAPITAL LETTER Z)


# Ian Hamilton on 6 Dec 2007 6:46 AM:

I may be a bit thick but I can't see how or why [A-Z] should be different to [ABC....XYZ]. If A = Unicode U+41 and Z = Uncode U+5a, then that to me means 26 characters and it makes no sense to interpret it any other way. I am no expert but to me it just does not make any sense for it to be otherwise

# Michael S. Kaplan on 6 Dec 2007 8:40 AM:

It does not work by saying "these discrete code units by their code points" -- it works by saying "everything that sorts between these letters" -- and all of these letters.

And all of those various "A" like, "B" like, "C" like, and son on characters sort between.....

Not sure if I'd call that something that makes you thick, by any means -- but there are two different possible ways for this to work, and it works in one way and you are expecting it to work in the other. That's all. :-)


referenced by

2007/12/11 In SQL Server, there is the rest of Unicode (aka the SiaO Incompleteness Theorem)

2007/12/10 In SQL Server, different collations implies different ranges (aka Not every table has its THORN)

2007/12/09 In SQL Server, the alphabet does not end at Z!

2007/12/08 In SQL Server, your ranges also need to ACCENT-uate the positives!

2007/12/07 In SQL Server, the wrong range can make you seem insensitive to one's width! (aka Do my V's look fat?)

2007/12/06 In SQL Server, A-Z, A-z, a-Z, and a-z may not mean the same thing!

go to newer or older post, or back to index or month or day