by Michael S. Kaplan, published on 2005/09/27 16:50 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/27/474568.aspx
A few years ago, there was someone internal at Microsoft who was asking about storing binary data in a string. Since they were using VB, I pointed out the many problems with a lot of VB (the intrinsic controls, the built-in functions, the Win32 and other API calls, and more), any of which could corrupt the binary data.
In fact, VB's easy conversions between byte arrays and strings are in many cases somewhat evil, given the chance of data corruption not too long after the conversion.
They were insistent that none of that would happen.
So I finally gave in and said maybe it will be fine, only to have someone else point out a fairly elemental issue -- that UTF-16 has an even number of bytes, while binary data may not. Plus a bunch of other reasons why this particular misuse of strings could really be a problem.
You know what? That person was right.
Then, a few years later, the question came up again, this time in VBScript. Of course everything is a Variant there, but people had problems where they wanted to try to interact with the bytes. So could they put it into a String and then use functions like AscB, LenB, and so on to work with it?
I honestly forgot about the earlier conversation.
So I gave the same warnings, they insisted that no operations involving conversion out of Unicode would be happening. And I said cautiously that if they follow those rules, then it might not hurt the data too much.
Luckily, Eric Lippert saved me from myself an once again pointed out why it was a bad idea.
I spent a little time trying to understand why I had forgotten the earlier conversation. Why the lessons I learn myself are so much easier to remember than the ones someone else points out. It was not embarrassment at being wrong or anything like that -- being wrong is how one learns!
Maybe it would be easier to remember if I did get embarrassed when I was wrong -- I never have trouble remembering times that I am embarrassed, after all. It is almost like my memory requires some kind of emotional tag -- good or bad-- to be effective.
Anyway, yesterday Shawn pointed out that you do not want to treat binary data like a string, and I did not have any problem with the advice, it all makes sense. But if you look at his post, the people who he is talking about are specifically trying to treat the binary data as a string and convert it out of Unicode, which is something I have warned against any time it came up. So I realized that I may not have learned anything yet, at least not to have a deeply internalized answer to the people who insist they will avoid converting the data.
So I finally did internalize a particular notion to cover both the practical and the theoretical aspects of the problem: it is not that strings are sacred or anything; they never were. It is that data is sacred.
Or at the very least that corrupting data (of any sort) is profane. :-)
We'll see if thinking that will help me remember the next time that comes up (of course the act of posting this to my blog may have contaminated the process; posting to the blog might make it memorable, too!).
# Maurits [MSFT] on 27 Sep 2005 5:32 PM:
# Michael S. Kaplan on 27 Sep 2005 6:22 PM:
# Vorn on 28 Sep 2005 4:25 AM: