by Michael S. Kaplan, published on 2005/03/11 22:33 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/03/11/394359.aspx
SQL Server uses the same linguistic data as that which ships in Windows, and has done so for many versions.
They do not actually call our APIs, they instead took a snapshot of them and they use that to give each SQL Server install consistent behavior no matter what version of Windows they are running on.
It also means they run a little behind us in terms of language support, but there are always downsides to snapshots (on the upside, they sometimes have data that is newer than the operating system upon which it is running, assuming you are on an old enough OS).
Trivia: Versions of the Microsoft Jet Engine prior to 4.0 (the one used by Access 2.0, 95, and 97, for example) depended on Windows for their sorting data for East Asian locales -- which could lead to interesting results if databases crossed versions of Windows. This is why both Jet and SQL Server wanted a better solution going foward! :-)
I thought I'd preface this column with that information. :-)
Anyway, there was a customer who was running queries similar to the following in SQL Server:
if N'เหม' like N'%ม%' select 'true' else select 'false'
if N'เม' like N'%ม%' select 'true' else select 'false'
(I made the text bigger so that even if you cannot read Thai you can see the visual differences between the Thai characters)
The strings in question:
เหม U+0e40 U+0e2b U+0e21 (THAI CHARACTER SARA E + THAI CHARACTER HO HIP + THAI CHARACTER MO MA)
เม U+0e40 U+0e21 (THAI CHARACTER SARA E + THAI CHARACTER MO MA)
ม U+0e21 (THAI CHARACTER MO MA)
The SQL Server 2000 database, being one that contains lots of Thai data, uses the Thai_CS_AS collation. Looking at the queries, they are basically an attempt to see if the third string can be found within the first two. Should be very straightforward since the character in question is there, right?
Well, it turns out that the first query returns true and the second one does not.
But everyone can see the character right there, so "why can't SQL Server?" is a question that comes to mind pretty quickly.
For a hint to the answer, do you recall the post I made entitled How do sort keys work?
SQL Server builds indexes using sort keys just like the ones you can create yourself using the LCMapString API in Windows and the SortKey class in the .NET Framework. That binary representation removes all of the information in the string other than what is needed to get identical results to a string comparison. In the case of Thai that sometimes includes combinations of letters that are treated as if they are a single unique sort element -- in this case the combination of the SARA E and the MO MA.
Suddenly, database developers are scrambling to their applications, since they might have just such querying functionality in applications.
Also, the developers who use the .NET Framework (who ran out in a panic at the same time as those database developers!) were thinking of all of their calls to the IsPrefix, IsSuffix, IndexOf, and LastIndexOf methods off of the CompareInfo object. Do they have the same problem? They frantically write the test code to check:
string string1 = "เม";
string string2 = "ม";
CompareInfo ci = CompareInfo.GetCompareInfo("th-TH");
Console.WriteLine(ci.IndexOf(string1, string2));
After running back, ready to post their flame at me, since the code worked fine and found string1 inside of string2 (returning 1), they get to the words I am about to type. :-)
CompareInfo's IsPrefix, IsSuffix, IndexOf, and LastIndexOf methods in the.NET Framework do not have the same problem, because they are using the original strings, not the sort keys. So the information is still there for them to use. Slightly slower than using sort keys, but in this case more functional.
Now hopefully the database developers are still reading, so I can let them know they do not have to worry. People who use these languages understand why certain groupings of characters behave a particular way. This sort of thing mainly is a problem with developers and testers who do not fully understand the area, which is Why international test is an art (and why there are few fine artists).
Think of it another way -- if you build indexes that are case insensitive, how do you then distinguish between upper and lower case in queries? You can use a binary collation for the comparison, or some collation other than Thai so that you will have not have these linguistically appropriate combinations of characters.
Just don't feel the need to do this for your customers who use these languages -- they will always be happier with linguistically appropriate results. Use this to keep that persistent tester happy. :-)
This post brought to you by who else but "ม" (U+0e21, a.k.a. THAI CHARACTER MO MA)
referenced by
2007/09/18 A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)
2006/10/04 Wild[card] thing, You make my CHAR sing
2006/01/23 It is not just the good SQL queries that have trouble, sometimes
2005/05/11 Achieving case insensitivity
2005/03/16 Reversing sort keys?