Some strings need to feel validated

by Michael S. Kaplan, published on 2006/04/17 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/04/17/577325.aspx

Reader J. Daniel Smith asked the following in the Suggestion Box:

I'm wondering about "invalid" strings in .NET; I searched through your archives and didn't find anything exactly on-point, maybe I didn't search long enough...

It's straight-forward to create a semantically invalid string:
char high = '\ud801'; // high: d800-dbff
char low = '\udcff'; // low: dc00-dfff
char[] chars = new char[8];
chars[0] = high;
chars[1] = low;

chars[2] = high;
chars[3] = high; // invalid

chars[4] = low; // invalid
chars[5] = low;

chars[6] = low; // invalid
chars[7] = high; // invalid

byte[] bytes = Encoding.Unicode.GetBytes(chars);
string s1 = Encoding.Unicode.GetString(bytes);
string s2 = new String(chars);
where and when can such a string create problems?

The reason this comes about is a bit more practical than the contrived sample; for example converting the value of RSAPKCS1SignatureFormatter.CreateSignature() to a string:
byte[] bytes = ...; // say RSAPKCS1SignatureFormatter.CreateSignature()
string s = Encoding.Unicode.GetString(bytes);
Eventually, the string will get converted to Base64 to make it easy to move around (via email, view in Notepad, etc.), but before that happens I want to combine it with other strings (but no sorting, searching, casing, etc.). Thus, I don't want to immediately convert the byte[] to Base64 and then add another Base64 conversion on top of that.

I've tried things like StringInfo.ParseCombiningCharacters() in an attempt to get the invalid string to "fail" somehow, but that seems to work fine. I'm sure strict Unicode semantics are enforced at some point, but where & when? If I just use a string for internal data (and say avoid displaying it to the user which would involve fonts, etc.) do I need to be concerned with Unicode semantics?

This is a topic about which I have not yet posted, so he wasn't missing anything. :-)

(By the way, you know the old saw about the guy who quit the patent office since there was nothing new that would be invented? It isn't really true, as the guy who said it was speaking against that point of view. There will always be some new internationalization issue to post about!)

If you look at the Encoding class, there is one way that may perhaps be prized above all others, especially starting in the 2.0 version of the .NET Framework -- the UnicodeEncoding class. The only two things that conversions in this "code page" will do is:

Move text between a byte array and a string, and
Validate the data to make sure it is not one of these illegal sequences

It has the additional advantage is that you can use the new fallback support in 2.0 to be able to find out the exact location and nature of problems and to be able to completely customize what happens in those cases....

I probably ought to put together a sample of a custom encoder/decoder fallback, though I'll probably just try and convince Shawn to do one and then link to his when he does it, instead. He coded the feature, after all. :-)

This post brought to you by "ģ" (U+0123, a.k.a. LATIN SMALL LETTER G WITH CEDILLA)

# Marc C Brooks on 17 Apr 2006 12:14 PM:

While your comments about the Encoding class are (no doubt!) correct, I think you've missed the point. He should NOT be converting the byte array to a string AT ALL. Never. Ever. It's just wrong.

Rather, he should concatenate the various byte arrays that he has (down-coverting any strings to byte arrays first, as needed). When he's ALL DONE building up this lovely sequence of bytes that have no relationship to a unicode string, he should then BASE64 encode the final array.

At the other end, he again converts from the BASE64 string to a byte array. This can then be decomposed appropriately into other bytes array slices, and THOSE can be handed to Encoding to get real string values.

This insures that he never gets (rightly, justly, and fairly) by the encoding not liking his arbitrary sequence of bytes; which is not, never will be, but might seem to be coincident with a unicode string.

# Mihai on 17 Apr 2006 12:40 PM:

100% agree with IDisposable. When one has a random sequence of bytes, which do not represent a string, then it is not correct to convert it to string.

# Michael S. Kaplan on 17 Apr 2006 12:59 PM:

I got the point, I just morphed the question into one that I thought was more important to answer....

I made the other point about not using strings to hold non-string data in a different post (http://blogs.msdn.com/michkap/archive/2005/09/27/474568.aspx) already. :-)

But most of this question was purposely constructing invalid surrogate sequences, which is not binary data but intentionally invalid strings.....

# Maurits [MSFT] on 17 Apr 2006 2:32 PM:

> concatenate the various byte arrays

Is there an overload of "+" for byte arrays?

# J. Daniel Smith on 18 Apr 2006 2:52 PM:

IDisposable: I was trying to figure out a way to not base64 encode data that is already base64 encoded; but I think I determined it wasn't possible in my situtation.

Using invalid surrogates was just to have code to show a known invalid string. Something like random bytes wouldn't necessarily always result in an invalid string.

http://blogs.msdn.com/michkap/archive/2005/09/27/474568.aspx might be the blog entry I didn't find.

# Michael S. Kaplan on 18 Apr 2006 4:08 PM:

You should not ever have to Base64 encode data that is already Base64 encoded --- but OTOH you should *never* be sending Base64 encoded strings through the encoding classes?

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day