Wild[card] thing, You make my CHAR sing

by Michael S. Kaplan, published on 2006/10/04 03:11 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/10/04/788429.aspx


How do those lyrics go again?

Wild[card] thing
You make my CHAR sing.
You make everything query
Wild[card] thing.

Wild[card] thing, your results might move me.
But I want to know for sure.
C'mon I'll run with you one more time.
Your results move me.

Or something like that?

Time to get serious now....

Documentation such as that for the LIKE Predicate in SQL Server's T-SQL queries is always something that one can easily look up if needed (though most developers who use SQL Server tend to just sort of learn these things, and they don't feel the need to look them up too often!).

In any case, an interesting point came up (while I was looking at another issue) that I have never seen any documentation about and that several people I asked honestly had no idea about either. First a bit from the above Books Online topic, where wildcards are explained:

Wildcard Description Example
% (percent) Matches zero or more of any character. 'comp%r' matches 'comp' followed by zero or more of any characters, ending in an r.
_ (underscore) Matches any single character. 'comp_ter' matches 'comp' followed by exactly one of any character, followed by 'ter'.
[ ] (square brackets) Matches any single character within the specified range or set. For example [a-z] specifies a range; [aeiou] specifies the set of vowels. 'comp[a-z]re' matches 'comp' followed by a single character in the range of lowercase a through lowercase z, followed by 're'.

'comp[ao]' matches 'comp' followed by a single character that must be either a lowercase a or a lowercase o.

[^ ] (caret) Matches any single character that is not within the specified range or set. For example, [^a-z] specifies a range that excludes lowercase a through lowercase z; [^aeiou] specifies a set that excludes lowercase vowels. 'comp[^u]' matches 'comp' followed by any single character that is not a lowercase u.

Now perhaps if you are a regular reader and remember reading When good SQL queries have trouble...., you may have some idea where I am going with this. :-)

Anyway, let's take a bit of SQL and try some things out. For the purposes of these queries, keep in mind that Traditional Spanish considers the letters ch to be a single unique sort element between c and d....

use master
IF DB_ID (N'dbTest') IS NOT NULL DROP DATABASE dbTest;
CREATE DATABASE dbTest;
GO
use dbTest
CREATE TABLE tblTest (
    colTest varchar (4) COLLATE Latin1_General_CS_AS_KS_WS NULL
)
GO
INSERT INTO tblTest (colTest) VALUES ('chap')

SELECT * FROM tblTest;

SELECT
* FROM tblTest WHERE colTest LIKE '_ap';

SELECT
* FROM tblTest WHERE colTest LIKE '_hap';

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '_ap';

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '_hap'

So if you just go ahead and run this whole script, here will be the results of the five queries:

There you have it! When SQL Server documentation claims that "_ (underscore) matches any single character", what it clearly means is not exactly a single CHARACTER at all. It means a single SORT ELEMENT!

(I have talked about sort elements in previous posts such as this one and this one.)

Of course there are problems here, since both the square brackets and caret style wildcards also refer to single characters but will only refer to sort elements in what they do not specify; everything specified has to be a single UTF-16 code unit, in most cases. Looking at some more queries:

SELECT * FROM tblTest WHERE colTest LIKE '[b-d]ap';            /* Returns 0 rows */

SELECT
* FROM tblTest WHERE colTest LIKE '[b-d]hap';           /* Returns 1 row  */

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '[b-d]ap';    /* Returns 1 row  */

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '[b-d]hap';   /* Returns 0 rows */

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '[ch-d]ap';   /* Returns 0 rows */

SELECT
* FROM tblTest WHERE colTest LIKE '[^ch]ap';            /* Returns 0 rows */

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '[^ch]ap';    /* Returns 1 row  */

SELECT
* FROM tblTest WHERE colTest
    COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '[^ch-d]hap'; /* Returns 1 row  */

SELECT * FROM tblTest WHERE colTest
   
COLLATE Traditional_Spanish_CS_AS_KS_WS LIKE '[^cdh]hap';  /* Returns 0 rows */

Clearly in all of these cases, just as in the ones before, it is assuming that what you type into the expression is individual UTF-16 code units, even though the engine will usually (maybe always?) treat compressions as single characters later on. This makes the whole definition of "single character" pretty fluid since it means different things depending on whether you are looking at the query beforehand or the engine running the query later.

So has anyone noticed this before and I (and those people writing docs) just haven't ever run across it?

I mean, is that really possible?

Or is this really a pseudo-stealth "feature" created by SQL Server's implementation of linguistic collation that nobody ever noticed before?

And either way, how on earth can this be "fixed" in the future, or even documented? I mean, given the need to avoid breaking existing (albeit undocumented) behavior?

This post brought to you by  C (U+0043, a.k.a. LATIN CAPITAL LETTER C)


# Michael S. Kaplan on 4 Oct 2006 7:16 PM:

I am starting to wonder if people who use SQL Server actually read this blog! :-(

Is there nobody who is either unbelieving of how dumb I am for not knowing this or shocked that they did not know it themselves?

# Eric Duran on 5 Oct 2006 3:02 PM:

I wasn't aware that Traditional Spanish was implementing the deprecated "ch" character. My bet is that the Royal Spanish Academy, foreseeing that our lovely "ch" would make using computers harder, just dropped the whole double characters at all (we used to have ll, rr and ch as single chars).  This was many years ago (10-12, I would say).

Interesting that you pointed out this, but that's yet another reason for not using the Traditional Spanish collation.

# Michael S. Kaplan on 5 Oct 2006 5:33 PM:

Hi Eric,

There are many other languages that include compressions that are valid -- this just happened to be an easy one to show. So one cannot avoid the problem unless one is also avoiding language support!

# Dean Harding on 5 Oct 2006 8:45 PM:

Well, I use SQL Server fairly often. But I tend to stay away from the LIKE operator whenever possible. It's not very index-friendly (at least, it's easy to USE in a non-index friendly way)...

# bobo on 27 Nov 2006 12:18 PM:

you said:

"Attention all database developers out there! I have been somewhat dismayed at getting so few comments"

I did not realize that because someone put something in their blog that was of interest to them, that the rest of the world was obligated to post a comment on the matter.

# Michael S. Kaplan on 27 Nov 2006 12:41 PM:

Obligated?

Not at all.

But usually if I post something that interests people, they comment about it. I was surprised at the lack of interest!

# Mike Dimmick on 27 Nov 2006 12:58 PM:

I don't think the underscore wildcard is actually used all that often. When you're searching for something in the database, you'll tend to use the % wildcard - typically at the end of the search term, if you're trying to take advantage of indexes. Still, the search results would presumably be different for 'c%' vs 'ch%'?

For any serious use of keyword matching, you would probably build a full-text search index, whose behaviour might be different (since the work is farmed out to an MSSearch process, IIRC).

# Luis Lobo Borobia on 30 Nov 2006 1:50 PM:

Wow. This is something i didn't know, and I'm a spanish talker. May be because, to keep ourself out of trouble we always (in our company) install the english version of sqlserver.

By the way, ch, now is no longer used as a single letter for sorting, it is put between cg and ci starting words. But it is a letter in the alphabet by its own. This is from the Real Spanish Academy.

(in spanish)

http://www.rae.es/rae/gestores/gespub000006.nsf/(voAnexos)/arch2A2805E472038A72C125717700366F42/$FILE/Cuestiones%20para%20el%20FAQ%20de%20consultas.htm#ap31

# Brian Gonsalves on 6 Dec 2006 6:34 PM:

I have one for you:

Identical data in identical tables on (supposidly) identical SQL servers.

server A:

select cv_id from cvspec where cv_id like 'cv1048__'

cv_id    

--------

CV1048  

CV10481  

CV104810

ServerB:

select cv_id from cvspec where cv_id like 'cv1048__'

cv_id    

--------

CV104810

Any idea what controls this behaviour?


referenced by

2008/10/06 UCS-2 to UTF-16, Part 4: Talking about the ask

2008/07/25 Let's save some time and call them all IRregular expression engines

2007/12/11 In SQL Server, there is the rest of Unicode (aka the SiaO Incompleteness Theorem)

2007/01/24 Sometimes a WCHAR really *is* just a character....

2007/01/15 With SQL Server (and SQL itself) comes the illogic of 'trailing spaces' (and the myth of fixed width)

2006/11/27 Any database developers reading this?

go to newer or older post, or back to index or month or day