Handling multilingual data in SQL Server

by Michael S. Kaplan, published on 2006/01/29 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/29/518777.aspx


Seth Siegal asked in the Suggestion Box:

What is the best practice for searching large table in mixed languages. Context is SQL Server 2000 or SQL Server 2005. The problem is storing in a single column character data from mixed languages and then providing a search capability to find the best match given a search string in some arbitrary language. The column, of course, is Unicode data type with some collation. Is there an optimal default collation? What is recommendation for table design for efficiency -- collation, indexes?

As an example, consider an international directory of business names and a stored procedure to search for a name "like" <some string> where <some string> is user input in any language. Since collation determines comparison rules it seems the appropriate collation is one that best matches the language of the search string. Further, to facilitate matching it seems appropriate to relax restrictions such as case and accent sensitivity. That is a strict binary comparison is not "user friendly." However there are serious performance implications when the collation of the search string does not match the collation of the database column and dynamic casting is used to make them match.

Example (approximate SQL):

MyBigTable (ID int not null, SearchString nvarchar(1000) not null collate <Optimal Collation For Mixed Languages>)
create procedure UserSearch @SearchString nvarchar(1000), @SearchCollation nvarchar(128) as
select * from MyBigTable
where SearchString collate @SearchCollation like @SearchString

The answer to this question is buried in some important implementation details surrounding the way collations work....

The first issue is that every "Windows" (which is to say, not SQL compatibility) collation is a view of everything in the table, according to a particular language or set of languages.

Now this helps the first parr if Seth's question -- any collation can be used if you are sticking with Unicode data (though how you want non-Unicode clients to behave when querying the data may have an influence your final choice).

However, because of this design, it is easy to find situations where weights in one collation will not distinguish as well as weights in another collation, depending on how important the differences are for a given language. As I point out in You can't ignore diacritics when a language does not give them diacritic weight, one person's primary distinction can easily be another person's secondary distinction. Especially in a case where one is planning to ignore case or accent differences, this means that one person's equality is another's inequality.

In order to properly support the querying of multilingual data well, you really do have to do both seek and ordering operations using the collation that will match the queryer's expectations.

But as Seth points out this can lead to serious performance issues.....

Luckily, I haved posted about how to work around that issue. :-)

In the post Making SQL Server index usage a bit more deterministic I explain how to make sure that performance will not suffer with such operations do to non-indexed queries being run. And I highly recoomend this technique be used any time you do have to commonly deal with multiple languages....

 

This post brought to you by "ޝ" (U+079d, a.k.a. THANNA LETTER SHEENU)


# Sabish on 1 Feb 2006 11:51 PM:

I am a Programmer

referenced by

2007/04/11 Microsoft is not uncaron^H^Hing about the issue!

2006/06/07 Performance issues with language specific sorts?

go to newer or older post, or back to index or month or day