With SQL Server (and SQL itself) comes the illogic of 'trailing spaces' (and the myth of fixed width)

by Michael S. Kaplan, published on 2007/01/15 06:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/01/15/1469885.aspx


There is a problem with the notion of both trailing spaces and fixed width in SQL Server, when you are using Windows collations.

I bring this up since I have a meeting this week where people will be talking about what are the differences between what Windows provides in comparisons and what SQL Server needs. And it is easiest when one is going into a meeting to discuss to technical issues if one understand the underlying technical information behind people's requirements.... :-)

In theory (if you are a regular reader here) you might even know what the problem is. Because even though I have never talked about it, I have talked about all of the underlying pieces that can lead one to understanding that there is indeed a problem.

Now the theory behind the trailing space behavior is actually based on a wider behavior in the SQL standard that goes beyond Microsoft SQL Server, and it is mainly there to help defined behavior of comparisons between CHAR and VARCHAR (or between NCHAR and NVARCHAR). Since CHAR/NCHAR is presumed to have a fixed length, in theory any comparison between CHAR//NCHAR columns of the same length are supposed to be "easier" due to that fixed length that they share.

But with Windows-based collations, it is the language-independent text elements and the language-dependent sort elements (both discussed previously in posts like this one and this one) that help to make up the actual elements that are compared. For our purpose let's assume that sort elements contain the set of all text elements and extend them with language specific behavior....

Thus on all collations U+00e5 (å, LATIN SMALL LETTER A WITH RING ABOVE) is equal to U+0061 U+030a (å, a.k.a. LATIN SMALL LETTER A + COMBINING RING ABOVE), even though they are not same length. This is a Unicode equivalence that is should always be true.

And on some but not all collations U+00c6 (Æ, a.k.a. LATIN CAPITAL LETTER AE) is treated as equal to U+0041 U+0045 (AE, a.k.a. LATIN CAPITAL LETTER A + LATIN CAPITAL LETTER E), even though they are not the same length. This is not a Unicode equivalence but one defined in individual collations (in this example most of them).

(for the most part text elements only refer to Unicode columns, but sort elements refer to both Unicode and non-Unicode comparisons....)

And since neither SQL Server nor the SQL Standard defines the notion of either text elements or sort elements, the whole conversation about comparisons between fixed length text columns is meaningless in the context of using Windows collations, which are based on the notion of comparing unique sort elements.

Thus the only thing that the weird and hard to define/explain behavior of trailing spaces and how they affect comparisons in SQL Server provide is a source of confusion and bugs, rather than a defined way for a collation function to perhaps behave.

As weird as the behavior I mentioned here, if not significantly weirder. And probably behind a lot of issues like the one Brian Gonsalves mentioned in this comment.

Does that mean that SQL Server needs a new set of text data types whose maximums involve sort elements?

Such a data type seems to really not be necessary or even useful -- the behavior of VARCHAR and NVARCHAR with the length based on the underlying storage seems more meaningful here, and turning off the whole trailing spaces behavior seems more sensible anyway.

To me at least.

In the end, people are going to have an uphill battle trying to describe the requirements behind anything other than padding strings with spaces themselves prior to calling a function like CompareString. Particularly needing new flags for special/different behavior for "trailing spaces". Since such behavior make no sense whatsoever from a collation perspective anyway.

It should be an interesting conversation, in any case.... :-)

 

This post brought to you by  (U+fb03, a.k.a. LATIN SMALL LIGATURE FFI)


# Philihp Busby on 15 Jan 2007 7:21 AM:

Although a white-box tester may be able to prove that those flags are necessary, I'd be interested to see if anyone has ever needed them. Seems like a difficult fix for an edge-case bug that  practically will never empirically happen.

# Dean Harding on 15 Jan 2007 7:25 AM:

To me, CHAR (and to a lesser extent, NCHAR) is really a just hangover from the COBOL days when all strings were fixed length anyway.

In almost all cases, [N]VARCHAR is going to be more efficient than [N]CHAR (except perhaps for strings < 5 characters or so).

And lets not get Norman Diamond started on the fact that CHAR(5) isn't actually storing 5 "characters" at all (whatever definition of "character" you choose to use...) ;-)

# Michael S. Kaplan on 15 Jan 2007 8:11 AM:

Well it keeps PSS busy! (e.g. here, here, and here).

# ReallyEvilCanine on 16 Jan 2007 10:42 AM:

Where outside of English and French is AE=Æ?

Dean: Perhaps CHAR should be changed to GLYPH to prevent the confusion. That goes hand-in-hand with Michael's explanation about U+00e5 (å) vs. U+0061 U+030a. The former is two character long, the latter two, but both are one (and the same) GLYPH.

# Michael S. Kaplan on 16 Jan 2007 11:14 AM:

Actually everywhere, REC, outside of a few languages (Danish Greeenlandic, Norwegian, and Icelandic).

# Michael S. Kaplan on 16 Jan 2007 11:19 AM:

As to the idea of adding the GLYPH notion....

Unicode has a term coined by Mark Davis -- "Grapheme Cluster". But first the question I posed must be answered -- is it useful as a DATATYPE rather than just maybe a T-SQL function or two?

# kevinowen on 16 Jan 2007 11:31 AM:

Isn't the issue with U+00e5 vs. U+0061 U+030a at least somewhat obviated by normalization? The SQL Standard states that:

"An SQL-implementation may assume that all UCS strings are normalized in Normalization Form C (NFC), as

specified by [Unicode15]. With the exception of <normalize function> and <normalized predicate>, the result

of any operation on an unnormalized UCS string is implementation-defined." (ISO/IEC 9075-2:2003 §4.2.8)

Is it fair to assume that SQL Server provides support beyond this, such as implicit normalization during collation operations? I'm guessing it probably is, which would explain why that particular case would still be an issue...

# Michael S. Kaplan on 16 Jan 2007 11:34 AM:

Well, if you normalize then it is, but SQL Server does not do that by default. SQLS for the most part just makes the two strings "the same" for most operations though not so much in T-SQL string handling. Thus the problem persists. :-(


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/10/06 UCS-2 to UTF-16, Part 4: Talking about the ask

2008/07/25 Let's save some time and call them all IRregular expression engines

2007/12/11 In SQL Server, there is the rest of Unicode (aka the SiaO Incompleteness Theorem)

go to newer or older post, or back to index or month or day