In SQL Server, there is the rest of Unicode (aka the SiaO Incompleteness Theorem)

by Michael S. Kaplan, published on 2007/12/11 10:01 -05:00, original URI:

Previously in this series, I have talked about many things. Like how the distance between A and Z is wider than you might think. how A-Z, A-z, a-Z, and a-z may not mean the same thing, how the wrong range can make you seem insensitive to one's width! (aka Do my V's look fat?), now your ranges also need to ACCENT-uate the positives, how the alphabet does not end at Z, and how different collations implies different ranges (aka Not every table has its THORN).

Amazingly, I have done this all while staying within the 1,229 characters in the default table for Windows (and somewhat fewer for SQL Server).

With Unicode about to cross the 100,000 assigned count of code points, that really is saying something!

The bottom line is that there are a whole bunch of code points in Unicode representing the letters, numbers, punctuation, and symbols of other languages throughout the world, more than a few of which are used by one or more of the collations in SQL Server.

Now if the previous five posts have not convinced you that ranges in LIKE expressions in SQL Server's T-SQL, then the simple fact that the many other alphabets in the world are entirely missed by these various ranges that you might come up with to try to work around the various issues....

Now it is true that in SQL Server 2005 you can use CLR integration to use either managed regular expression support (ref: Regular Expressions Make Pattern Matching And Data Extraction Easier) or the CharUnicodeInfo class.

Now as I pointed out in No way to get that script info I was looking for earlier, regular expressions in the CLR don't have the greatest Unicode version support story in terms of being up to the latest version, though as regular expression may be a more natural way to solve problems working with databases than writing code that would use something like the CharUnicodeInfo class (which for all its flaws has the advantage of supporting Unicode 5.0!).

Though if you do want to go down either road, keep in kind the lessons of Not all in sync quite yet (aka SQL and the CLR and Windows and .NET) since SQL Server is not in sync with the components you will be using.

Plus, even if they were consistent in terms of repertoire, the fact remains that SQL Server (like Windows) tends to sport many similar things near each other even if they are conceptually further away due to their intended use.

So the advantage to staying in the world of T-SQL is that your results will be consistent, if not complete -- whereas these other solutions will be closer to complete, though not always consistent.

Sort of an SiaO Incompleteness Theorem for you to chew on!

Of course there are also many other problems staying in the T-SQL world that I have talked about in the past in posts like Wild[card] thing, You make my CHAR sing and With SQL Server (and SQL itself) comes the illogic of 'trailing spaces' (and the myth of fixed width).

I guess I could easily just cop out, look at the whole mess, and tell you that you are screwed. If I were "Doctor International" I could maybe prescribe you some Valium to help you through the nervous breakdown that all of this randomosity might inspire.

But instead, let's just take a step back, and you can work to base what you are trying to do on the scenario, on what you want to accomplish.

And lay off the clever expressions to fix the bugs since (a) they aren't and (b) they won't, in the long run....


This post brought to you by  C (U+0043, a.k.a. LATIN CAPITAL LETTER C)

no comments

go to newer or older post, or back to index or month or day