by Michael S. Kaplan, published on 2005/05/14 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/14/417384.aspx
Uwe Keim asked (in the suggestion box) several questions about encoding support and strings in the .NET Framework:
Michael, I need your knowledge about how all this encoding stuff on strings work on .NET (I searched hours without finding usful things).
Maybe you can enlight me a little bit. Maybe I can motivate you with some italien Limonata? :-)
Well, no Limonata needed (I already have a big stack of cases of it already <grin>). I thought I would answer the qustions here even without the generous bribe :-)
- how do I detect the encoding of the content of a System.String?
Ah, that part is easy -- System.String is always assumed to be UTF-16 LE (little endian). Always. If you convert it to anything else, you get a byte array in the encoding you convert to.
Now this is a change from the old world of VB and VBA and VBScript, where people would often use the String type to store non-Unicode string things, occasionally getting into trouble when conversions happened. But it is how things are now.
Now the detection of encodings in byte arrays is an interesting question, one that does not have a good managed answer at this time.
- how can we at all convert a string from one encoding to another encoding (like in the example at http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemtextencodingclasstopic.asp), since I read that .NET internally stores all strings as UTF-16 in memory? (or do I confuse codepage with encoding?)
Well, encodings are the classes that make use of code pages -- so they are in essence the same thing. The code pages are esssentially mappings between Unicode and various non-Unicode encodings -- the "String" end is UTF-16 LE and the "byte" end is the other encoding, whatever it is.
So you can never convert from one encoding to another -- but you can use Unicode as a pivot -- going from one encoding to Unicode to another encoding, if you want to.
- What happens behind the scenes when loading a database value from NVARCHAR into System.String? Is there any encoding/codepage-conversion done?
In SQL Server 7.0, 2000, or 2005, the NTEXT, NVARCHAR, and NCHAR types also represent UTF-16 LE, so no conversion needs to be done.
If you come from a TEXT, VARCHAR, or CHAR column, then a conversion must be done. Both SQL Server and the .NET Framework have support for that conversion (SQL Server uses the MultiByteToWideChar function, and the .NET Framework uses an Encoding object based on the appropriate code page value).
To be honest, I am not sure which product does the actual conversion in this case, but it is pretty much the same operation, either way. :-)
If you have any other questions about encodings/code pages, you can ask them here. You can also take a look at Shawn Steele's blog (he is the dev. owner of the encoding stuff in Windows and the .NET Framework).
This post brought to you by "హ" (U+0c39, a.k.a. TELUGU LETTER HA)
A letter that can only be found on one code page on Windows -- ISCII 57005 -- supports, unless you count UTF-8 and GB-18030 that support all of Unicode....
# bg on 14 May 2005 4:16 PM:
# Michael S. Kaplan on 14 May 2005 5:07 PM:
# bg on 14 May 2005 5:40 PM:
# Michael S. Kaplan on 14 May 2005 6:49 PM:
# Uwe Keim on 15 May 2005 1:52 AM:
# Dean Harding on 15 May 2005 2:46 AM:
# Michael S. Kaplan on 15 May 2005 10:31 AM: