Trying to ignore the small stuff is harder, if you're Arabic

by Michael S. Kaplan, published on 2008/11/11 10:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/11/11/9056745.aspx


Via the Contact link, Alain asked:

Hello Michael,

I ask you about a problem I searched on the net all morning and get no response.

We work à UNESCO (Paris/France) on a multi-lingual database (SQL Server 2005). We actually add Arabic to a English/French/Spanish/Russian thesaurus.

We have Arab people at Alexandria test our application and they complained about not getting response when searching in Arabic with letters having not the same diacritics (e.g. Alif with and without Hamzah).

We use SQL_Latin1_General_CP1_CI_AI, but I tried with Latin1_General_CI_AI and Arabic_CI_AI and got the same result.

My questions : is there a way to add my own collation to a SLQ 2005 server. Or is there a collation just ignoring *all* diacritics for every UNICODE character ? And why does Arabic_CI_AI  not ignore Hamzah on Alif ?

I wonder if I am the only guy around the world searching Arabic text on a SQL Server database. I am not an Arabic reader not speaker, but it seems the the requirment is very basic for Arabic...

Thank you and, please, forgive my poor English

Very good bunch of issues in there that all deserve some coverage! :-)

Starting with the easiest part: SQL collations are terrible and essentially useless in most cases. My words in SQL Server: compatibility collations vs. Window collations are probably the best answer here to explain why not to use them. It is just that given how awful the SQL compatibility collations are for text outside of English, they are pretty much only the default in the US (otherwise SQL Compatibility collations are a bit too retro because Unicode and SQL Collations have nothing to do with each other).

So less than ideal results there are kind of par for the course....

Then there is the fact that Latin1_General_CI_AI and Arabic_CI_AI return the same results. This is actually also expected since both collations use the default table and the only difference between them in SQL Server is how they have different code pages attached to them for non-Unicode columns (1252 for the one, 1256 for the other).

Therefore, this too is expected.

Ok, enough stalling -- let's get too the actual issue -- the incorrect results!

This is a longstanding bug that I have previously described in Is it punctuation, symbol, or diacritic?, which explains the nature of the problem and describes how in some cases NORM_IGNORESYMBOLS will help here when one is dealing with Windows 2000, XP, or Server 2003.

Unfortunately there is no way to set this flag in SQL Server, so in the end there is no collation setting to work around the bug in SQL Server 7.0, 2000, or 2005.

However, Is it punctuation, symbol, or diacritic? explains how Vista and Server 2008 actually fix this longstanding issue. and the cost of fixing eight separate problems with Arabic script collations was just one bug, in Persian (ref: Hello Madda, Hello Father (Iranian style)).

and how does SQL Server get this fix?

Ah, for that you can find the answer in On changing the world, or at least the way people order things in it, which explains that SQL Server 2008 has the absolute latest version of the tables to date when SQL Server shipped, and thus has the fix for this bug in it.

There is, however, no downlevel fix for this problem that has really been around in Windows for as long as Arabic support has been in the product and in SQL Server for as long as Arabic support has been in that product.

Custom collations or any way to modify collations? That is a feature that does not exist in either windows or SQL Server....


This blog brought to you by ب (U+0628, aka ARABIC LETTER BEH)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day