The SQL app that works fine until you have to support Chinese....

by Michael S. Kaplan, published on 2011/08/25 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/08/25/10198837.aspx

I coming to you based on a question I have post on SQL forum relative to how can I handle Chinese characters in SQL server tables.

I have been pointed to your blog by a guy, hoping you could help me cause I could not find any answer for now.

The scenario is as follow :
I have an application which is getting different text from an SQL server database. Those database can contains different text language inside and based on the system settings I am fetching from my tables proper language ID identify by its short string ( Ex: US-en would be en)

We have then now a customer which request to get Chinese characters. For that he has send us its Chinese text in CSV format and then through how application we import those text inside SQL. After importing we have notice that all imported Chinese characters are shown in SQL server tables as "?".

For now our SQL server version is US running under Windows server 2008 US. We use default SQL server settings for Collation.

As I have never work with Chinese characters yet in SQL I have no idea how to set SQL server properly.

By long-standing convention/decision, the default server collation for the US product is a SQL compatibility collation.

But rather than run through collation choices that would still lead to limitations (both code page 936 and 950 are missing characters), the better strategy in the end will be to use Unicode columns.

A stupid trick that results in smaller databases on average is to use UTF-8 declared as Windows-1252.

Consider your plans very carefully before using it. You don't get certain advantages of storing data in its native character set (case insensitive, correct local sorting, etc.).

It also will give worse results for Chinese (since Han uses 3-4 bytes per characer in UTF-8)....

Actually I've seen one application that's created by the time of one of the first SQL version mandates their user to set locale to US English SQL collation too. (Changing collation is extremely painful if your database contains large number of stored procedures using temp tables.)

Since the application has to deal with Chinese characters, they implemented interesting method to store Chinese characters as Big5 code points in the Chinese character fields. (For example, "，" would be stored as #A141) In this way, common SQL functions such as "like" comparison would actually work too.)

Losing so much of the actual supported functiionality in SQL Server and risking data corruption from valid operations is seldom worth the risk....

But what about Surrogate Pairs (four bytes characters)?

Let's say we have to support Multilingual Data Processing so we cannot switch collation to Chinese. How to limit results then? I mean when you say "SELECT * FROM something WHERE key='chinese'?

OK, let's face it there is probably no way (at least not in MS SQL 2005). I am not sure if it is working in newer versions...

Of course there is a way -- SQL Server collation support in all versions, 2000 and later! Surrogate pairs require SQL 2008 or later collations (or binary ones)....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.