It is not just the good SQL queries that have trouble, sometimes

by Michael S. Kaplan, published on 2006/01/23 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/23/515987.aspx


It was way back in March of 2005 that I talked about a class of problems in a post entitled When good SQL queries have trouble.

In that post I contrasted the behavior of SQL Server and the behavior of .NET and I made the claim that SQL Server would never see the individual characters where .NET always would.

Allow me to correct myself a little bit. :-)

I will suggest some different strings:

เก์    U+0e40 U+0e01 U+0e4c     (THAI CHARACTER SARA E + THAI CHARACTER KO KAI + THAI CHARACTER THANTHAKHAT)

ก์     U+0e01 U+0e4c             (THAI CHARACTER KO KAI + THAI CHARACTER THANTHAKHAT)

There are two important differences with these two strings:

  1. Both of these combinations exist in the compression tables for Thai, and
  2. one is actually a subset of the other.

So, when looking at the following code:

using System;
using System.Globalization;

CompareInfo ci = CompareInfo.GetCompareInfo("th-TH");

Console.WriteLine(ci.IndexOf("0123\u0e40\u0e01\u0e4c456", "\u0e40"));
Console.WriteLine(ci.IndexOf("0123\u0e40\u0e01\u0e4c456", "\u0e01"));
Console.WriteLine(ci.IndexOf("0123\u0e40\u0e01\u0e4c456", "\u0e4c"));

What will be seen in the console will not be what you get for the non-Thai and the binary cultures:

4
5
6

For Thai, it will instead be:

-1
-1
6

The set up for the post probably gives away the reason -- if a compression in seen that starts off with the character for which the search is happening, then it will not see the individual character. But anytime the character being seached for is not the start of a compression, then the search attempt succeeds.

This is clearly an implementation-specific issue, just like the original SQL Server one. Though the .NET Framework issue makes me a little more uncomfortable since if you are looking at it from a black box testing point of view, there is a real inconsistency that is hard to get a good handle on. Though the advice of the original post does help reassure me somewhat:

People who use these languages understand why certain groupings of characters behave a particular way.

Since those people would not expect to find these cases ever, they won't usually be surprised when they are missed (and when they are hit, some sort of assumption about an over-eager program will probably be assumed).

Which is not to say that it isn't a bug, or that some thought into fixing it should be considered. It just explains why it has never been reported before even though it has existed in Windows for as long as compression have! :-)

(Thanks once again to Ryan Cavalcante for managing to ferret out a hard to find issue that led to finding another not-entirely-related one!)

 

This post brought to you by "เ" (U+0e40, a.k.a. THAI CHARACTER SARA E)

 


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day