by Michael S. Kaplan, published on 2011/08/25 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/08/25/10198837.aspx
The other day, Serge asked me via the Contact link:
I coming to you based on a question I have post on SQL forum relative to how can I handle Chinese characters in SQL server tables.
I have been pointed to your blog by a guy, hoping you could help me cause I could not find any answer for now.
The scenario is as follow :
I have an application which is getting different text from an SQL server database. Those database can contains different text language inside and based on the system settings I am fetching from my tables proper language ID identify by its short string ( Ex: US-en would be en)
We have then now a customer which request to get Chinese characters. For that he has send us its Chinese text in CSV format and then through how application we import those text inside SQL. After importing we have notice that all imported Chinese characters are shown in SQL server tables as "?".
For now our SQL server version is US running under Windows server 2008 US. We use default SQL server settings for Collation.
As I have never work with Chinese characters yet in SQL I have no idea how to set SQL server properly.
Hope you could give me a lift
Regards
Serge
By long-standing convention/decision, the default server collation for the US product is a SQL compatibility collation.
One that cannot handle Chinese in its non-Unicode columns.
But rather than run through collation choices that would still lead to limitations (both code page 936 and 950 are missing characters), the better strategy in the end will be to use Unicode columns.
And of course to use Unicode throughout the process, so you never lose data....
Joshua on 25 Aug 2011 9:18 AM:
A stupid trick that results in smaller databases on average is to use UTF-8 declared as Windows-1252.
Consider your plans very carefully before using it. You don't get certain advantages of storing data in its native character set (case insensitive, correct local sorting, etc.).
Michael S. Kaplan on 25 Aug 2011 12:51 PM:
It also will give worse results for Chinese (since Han uses 3-4 bytes per characer in UTF-8)....
cheong00 on 25 Aug 2011 6:34 PM:
Actually I've seen one application that's created by the time of one of the first SQL version mandates their user to set locale to US English SQL collation too. (Changing collation is extremely painful if your database contains large number of stored procedures using temp tables.)
Since the application has to deal with Chinese characters, they implemented interesting method to store Chinese characters as Big5 code points in the Chinese character fields. (For example, "," would be stored as #A141) In this way, common SQL functions such as "like" comparison would actually work too.)
Joshua on 26 Aug 2011 11:49 AM:
The usage model that I encountered was Chinese names in otherwise English text.
Michael S. Kaplan on 26 Aug 2011 4:22 PM:
Losing so much of the actual supported functiionality in SQL Server and risking data corruption from valid operations is seldom worth the risk....
Pawel Dyda on 2 Sep 2011 11:20 AM:
But what about Surrogate Pairs (four bytes characters)?
Let's say we have to support Multilingual Data Processing so we cannot switch collation to Chinese. How to limit results then? I mean when you say "SELECT * FROM something WHERE key='chinese'?
OK, let's face it there is probably no way (at least not in MS SQL 2005). I am not sure if it is working in newer versions...
Michael S. Kaplan on 2 Sep 2011 12:18 PM:
Of course there is a way -- SQL Server collation support in all versions, 2000 and later! Surrogate pairs require SQL 2008 or later collations (or binary ones)....