Encoding questions from the Suggestion Box

by Michael S. Kaplan, published on 2005/05/14 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/14/417384.aspx


Uwe Keim asked (in the suggestion box) several questions about encoding support and strings in the .NET Framework:

Michael, I need your knowledge about how all this encoding stuff on strings work on .NET (I searched hours without finding usful things).

Maybe you can enlight me a little bit. Maybe I can motivate you with some italien Limonata? :-)

Well, no Limonata needed (I already have a big stack of cases of it already <grin>). I thought I would answer the qustions here even without the generous bribe :-)

- how do I detect the encoding of the content of a System.String?

Ah, that part is easy -- System.String is always assumed to be UTF-16 LE (little endian). Always. If you convert it to anything else, you get a byte array in the encoding you convert to.

Now this is a change from the old world of VB and VBA and VBScript, where people would often use the String type to store non-Unicode string things, occasionally getting into trouble when conversions happened. But it is how things are now.

Now the detection of encodings in byte arrays is an interesting question, one that does not have a good managed answer at this time.

- how can we at all convert a string from one encoding to another encoding (like in the example at http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemtextencodingclasstopic.asp), since I read that .NET internally stores all strings as UTF-16 in memory? (or do I confuse codepage with encoding?)

Well, encodings are the classes that make use of code pages -- so they are in essence the same thing. The code pages are esssentially mappings between Unicode and various non-Unicode encodings -- the "String" end is UTF-16 LE and the "byte" end is the other encoding, whatever it is.

So you can never convert from one encoding to another -- but you can use Unicode as a pivot -- going from one encoding to Unicode to another encoding, if you want to.

- What happens behind the scenes when loading a database value from NVARCHAR into System.String? Is there any encoding/codepage-conversion done?

In SQL Server 7.0, 2000, or 2005, the NTEXT, NVARCHAR, and NCHAR types also represent UTF-16 LE, so no conversion needs to be done.

If you come from a TEXT, VARCHAR, or CHAR column, then a conversion must be done. Both SQL Server and the .NET Framework have support for that conversion (SQL Server uses the MultiByteToWideChar function, and the .NET Framework uses an Encoding object based on the appropriate code page value).

To be honest, I am not sure which product does the actual conversion in this case, but it is pretty much the same operation, either way. :-)

If you have any other questions about encodings/code pages, you can ask them here. You can also take a look at Shawn Steele's blog (he is the dev. owner of the encoding stuff in Windows and the .NET Framework).

 

This post brought to you by "" (U+0c39, a.k.a. TELUGU LETTER HA)
A letter that can only be found on one code page on Windows -- ISCII 57005 -- supports, unless you count UTF-8 and GB-18030 that support all of Unicode....


# bg on 14 May 2005 4:16 PM:

i had a strange run in with an Encoder (UTF8Encoding) yesterday.

i had some bytes in an array, and needed to convert them to a string so i did this:

char[] chars = Utf8Encoding.UTF8.GetChars(bytes);
string xml = new string(chars);

and to my surprise the string started with a BOM. this is really annoying because i can't find anything in the docs that would indicate this would happen. (mind u i haven't looked that hard :))

its easy to fix: ... xml = new string(chars,1,chars.Length-1);

but still annoying!

bg

# Michael S. Kaplan on 14 May 2005 5:07 PM:

If there is a BOM in the UTF-8, it should still be there after conversion, right?

# bg on 14 May 2005 5:40 PM:

encoding problem:

actually thinking about it, it must be me thats writing crap code, I thought the BOM was being introduced by the Getchars call, but it can't be because it would push all my chars down in the the string and and truncate it - which it wasn't - the BOM is being introduced further up in my code. The actual "string" is being build up by an xmlwriter thats wrapped around a memorystream and it must be adding a BOM to the byte array i'm getting out of it (via a call o memorystream.read).

sorry my fault - me being stoopid.

i've just got back from the theatre (Kevin spacey in Philadelphia story - very good) And it's taking a while to get the mind back into gear - don't ever get old ;)

bg

# Michael S. Kaplan on 14 May 2005 6:49 PM:

No worries, bg. Glad you were able to track down the cause.

# Uwe Keim on 15 May 2005 1:52 AM:

Thank you very much for your answers!

# Dean Harding on 15 May 2005 2:46 AM:

I would assume that the conversion from TEXT, CHAR and VARCHAR to unicode actually happens at the database, since the db libraries are all COM and COM is Unicode as well.

But that's just a guess...

# Michael S. Kaplan on 15 May 2005 10:31 AM:

Hi Dean,

Ordnarily I would agree, but the one thing that made me wonder would be the scenario of a "native" managed provider -- would it be required to let the database do it? Or would it have the freedom to do it itself, later?

I do not know enough about the new data provider model's internals....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day