Bytes and Characters and bugs and W's

by Michael S. Kaplan, published on 2010/06/20 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/06/20/10027607.aspx


There are times that I am very happy for some of the eccentricities of the way I look at code.

And the way they keep me from certain kinds of bugs.

Like a few hours ago when I happened to spot Matthew Wilson's memset() Considered Harmful - especially to those who (think they) know what they're doing! blog.

Just the title at first.

And two things immediately came to mind.

The first thing?

I know the problem he ran into.

Did you see it too? Before you clicked on the link, I mean. :-)

The second thing that came to mind?

Too bad he didn't use wmemset; it would have saved some time here!

The truth is that one of the biggest sources of bugs if one moves a lot between Unicode and non-Unicode programming is byte/character count problems. In fact that is one of the great things about wmemset, the way it (and its ilk) takes that particular variable out of the equation even if one doesn't happen to have std::fill_n() at one's disposal. :-)

My trial by fire for all of this was MSLU; the constant need on a per-function basis to be thinking about the Unicode and the non-Unicode kept me on my toes here and really drilled the issues into me to think very very carefully about buffer sizes. And also to feel smugly superior to all the Win9x code that tended to pop up with "bugs" occasionally related to non-Unicode buffers that were twice the size they needed to be except on the CJK versions where those were the expected buffers (and no they were not thinking ahead brilliantly, they were just messing up byte/character counts in WideCharToMultiByte calls!).

Now there are some flaws in the docs for wmemset and its ilk that I just noticed, like the security warning that really is two different problems between memset() and wmemset() and therefore deserves a bit of wordsmithing beyond a generic pointer to warnings about avoiding buffer overruns.

And the suggestion that the .Net Framework equivalent for memset() and wmemset() is System::Buffer::SetByte?

That's a keeper, for sure. I mean what better way to introduce byte/character mismatches into the .Net world so elegantly than that method and a cast or two? :-)

Now of course the viewpoint doesn't make me invulnerable to all bugs, it just makes a certain class a bug a lot less likely....


Seth on 20 Jun 2010 1:11 PM:

Even using wchar_t based functions isn't a general solution, since it only works with characters that are in the BMP.

Michael S. Kaplan on 20 Jun 2010 1:23 PM:

Well, if you're gonna drag user characters into it then its not just supplementary characters, it's variation selectors and nonspacing characters and other grapheme cluster causers as well! :-)

But for the low level, avoiding security bugs on whole strings one knows to be valid, wmemset does okay....


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day