Searching encoded strings in SQL Server 2000?
by Michael S. Kaplan, published on 2005/10/13 14:45 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/10/13/480718.aspx
A guy named Michael (who is not me) asked the following question in the Suggestion Box:
Is it possible to search an NVARCHAR column containing string data encoded in a number of different encoding types?
I'm dealing with iso-8859-15, big5, utf-8, iso-8859-1, us-ascii encoding types and need to search the strings for keywords using either full-text indexing or the LIKE clause.
The quick answer is no.
The slightly less quick answer is that there is no way to really store text in a way that is telling a lie about what it is encoded with and then search it in those muliple forms.
Now I am not advocating this issue for moral reasons (though it is strangely compelling to me to be able to claim that my code is honest!).
SQL Server does have the ability in its non-Unicode data types to store data in different encodings and have SQL Server properly search across them. As a rule you do not ever want to lie about the encoding because SQL Server uses that as a hint on how it can use the data and what can be done with it. It is definitely one way to approach the desire to stire data in different encodings (the better way would be to store the data in Unicode and then convert any time you have to communicate with legacy systems that do not understand Unicode....
This post brought to you by "ᡁ" (U+1841, a.k.a. MONGOLIAN LETTER ZHI)
# Jerry Pisk on 13 Oct 2005 3:28 PM:
Isn't NVARCHAR Unicode string data? Regardless of how it's encoded internally you should only put valid Unicode strings to it (mapping from whatever your source character set is to Unicode). How it's encoded inside of MSSQL should not matter, should it?
# Michael S. Kaplan on 13 Oct 2005 3:35 PM:
Hi Jerry -- you are right. Only Unicode data should be there. Internals matter since nothing can contain all that Unicode does.
# Maurits [MSFT] on 13 Oct 2005 6:04 PM:
nvarchar is always UTF-16, right?
And varchar encoding is determined by the collation options for the server, database, table, or column (level of granularity differing by SQL Server version)?
But if you want to store /different encodings/ in a /single column/, you'd have to use varbinary...
CREATE TABLE InternationalTextExample
The UnicodeText would be LIKEable and full-text-indexable, sure...
But is there a SQL expression I could use to convert the LocalEncodedText to nvarchar using the LocalEncodingScheme? I've been digging through CAST() and CONVERT() and COLLATE documentation, but I can't quite see it.
I suppose another method would be to use
CREATE TABLE InternationalTextExample2
KoreanText varchar(200) COLLATE Korean_Wansung_CI_AS,
Latin1Text varchar(200) COLLATE SQL_Latin1_General_Cp1_CI_AI
and store equivalent forms on a "best effort" basis.
# Jerry Pisk on 14 Oct 2005 2:03 PM:
You do not have to use varbinary, you can still use nvarchar and lie about what characters are in there. For example, you put the yen character in as U+005C (as it is in 932) instead of the correct u+00a5 at the same time another row would have a character U+005C that actually means the won sign (in 949). SQL Server will think that you're putting in the "whack" in both cases. I think the original question was about this kind of shenanigans, how to search the string data when it's not actually Unicode.
# Michael S. Kaplan on 14 Oct 2005 3:37 PM:
Jerry, you really need to avoid that, as there are countless ways that this type of methodology can destroy data. I highly recommend NEVER LYING ABOUT AN ENCODING.
If you ignore this and later on are hurt by your decision, be sure to rememer who tried to warn you!
# Maurits [MSFT] on 16 Oct 2005 9:22 AM:
(Sung to the tune of The Bells of Notre Dame)
You can lie to yourself and your server...
You can claim that you haven't a qualm.
But you never can run from,
Nor hide what you've done from
The very Eyes of Unicode!
# Jerry Pisk on 17 Oct 2005 12:15 PM:
Michael, I would personally never do anything like that, I was just trying to guess what the original question was about. I go through a lot of trouble to find out the actual encoding, recently I was dealing with mainframe produced data that was being sent in what was described as ASCII. It took me a long time (weeks) to convince the mainframe folks that ASCII only defines the 7-bit characters and that the rest is in a code page and about a week after that to find out what that code page was. Of course I had management on my back the whole time because to them I was just delaying our project becuase of something that was not important…
Please consider a donation
to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
go to newer or older post, or back to index or month or day