by Michael S. Kaplan, published on 2006/04/17 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/04/17/577325.aspx
Reader J. Daniel Smith asked the following in the Suggestion Box:
I'm wondering about "invalid" strings in .NET; I searched through your archives and didn't find anything exactly on-point, maybe I didn't search long enough...
It's straight-forward to create a semantically invalid string:
char high = '\ud801'; // high: d800-dbff
char low = '\udcff'; // low: dc00-dfff
char[] chars = new char[8];
chars[0] = high;
chars[1] = low;
chars[2] = high;
chars[3] = high; // invalid
chars[4] = low; // invalid
chars[5] = low;
chars[6] = low; // invalid
chars[7] = high; // invalid
byte[] bytes = Encoding.Unicode.GetBytes(chars);
string s1 = Encoding.Unicode.GetString(bytes);
string s2 = new String(chars);
where and when can such a string create problems?
The reason this comes about is a bit more practical than the contrived sample; for example converting the value of RSAPKCS1SignatureFormatter.CreateSignature() to a string:
byte[] bytes = ...; // say RSAPKCS1SignatureFormatter.CreateSignature()
string s = Encoding.Unicode.GetString(bytes);
Eventually, the string will get converted to Base64 to make it easy to move around (via email, view in Notepad, etc.), but before that happens I want to combine it with other strings (but no sorting, searching, casing, etc.). Thus, I don't want to immediately convert the byte[] to Base64 and then add another Base64 conversion on top of that.
I've tried things like StringInfo.ParseCombiningCharacters() in an attempt to get the invalid string to "fail" somehow, but that seems to work fine. I'm sure strict Unicode semantics are enforced at some point, but where & when? If I just use a string for internal data (and say avoid displaying it to the user which would involve fonts, etc.) do I need to be concerned with Unicode semantics?
This is a topic about which I have not yet posted, so he wasn't missing anything. :-)
(By the way, you know the old saw about the guy who quit the patent office since there was nothing new that would be invented? It isn't really true, as the guy who said it was speaking against that point of view. There will always be some new internationalization issue to post about!)
If you look at the Encoding class, there is one way that may perhaps be prized above all others, especially starting in the 2.0 version of the .NET Framework -- the UnicodeEncoding class. The only two things that conversions in this "code page" will do is:
It has the additional advantage is that you can use the new fallback support in 2.0 to be able to find out the exact location and nature of problems and to be able to completely customize what happens in those cases....
I probably ought to put together a sample of a custom encoder/decoder fallback, though I'll probably just try and convince Shawn to do one and then link to his when he does it, instead. He coded the feature, after all. :-)
This post brought to you by "ģ" (U+0123, a.k.a. LATIN SMALL LETTER G WITH CEDILLA)
# Marc C Brooks on 17 Apr 2006 12:14 PM:
# Mihai on 17 Apr 2006 12:40 PM:
# Michael S. Kaplan on 17 Apr 2006 12:59 PM:
# Maurits [MSFT] on 17 Apr 2006 2:32 PM:
# J. Daniel Smith on 18 Apr 2006 2:52 PM:
# Michael S. Kaplan on 18 Apr 2006 4:08 PM: