Bytes and Characters and bugs and W's

by Michael S. Kaplan, published on 2010/06/20 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/06/20/10027607.aspx

There are times that I am very happy for some of the eccentricities of the way I look at code.

The truth is that one of the biggest sources of bugs if one moves a lot between Unicode and non-Unicode programming is byte/character count problems. In fact that is one of the great things about wmemset, the way it (and its ilk) takes that particular variable out of the equation even if one doesn't happen to have std::fill_n() at one's disposal. :-)

My trial by fire for all of this was MSLU; the constant need on a per-function basis to be thinking about the Unicode and the non-Unicode kept me on my toes here and really drilled the issues into me to think very very carefully about buffer sizes. And also to feel smugly superior to all the Win9x code that tended to pop up with "bugs" occasionally related to non-Unicode buffers that were twice the size they needed to be except on the CJK versions where those were the expected buffers (and no they were not thinking ahead brilliantly, they were just messing up byte/character counts in WideCharToMultiByte calls!).

Now there are some flaws in the docs for wmemset and its ilk that I just noticed, like the security warning that really is two different problems between memset() and wmemset() and therefore deserves a bit of wordsmithing beyond a generic pointer to warnings about avoiding buffer overruns.

And the suggestion that the .Net Framework equivalent for memset() and wmemset() is System::Buffer::SetByte?

That's a keeper, for sure. I mean what better way to introduce byte/character mismatches into the .Net world so elegantly than that method and a cast or two? :-)

Now of course the viewpoint doesn't make me invulnerable to all bugs, it just makes a certain class a bug a lot less likely....

Even using wchar_t based functions isn't a general solution, since it only works with characters that are in the BMP.

Well, if you're gonna drag user characters into it then its not just supplementary characters, it's variation selectors and nonspacing characters and other grapheme cluster causers as well! :-)

But for the low level, avoiding security bugs on whole strings one knows to be valid, wmemset does okay....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.