Some people feel really insecure about the size of their [string] members

by Michael S. Kaplan, published on 2006/11/10 06:05 -05:00, original URI:

Developer Andrew Arnott asks:


David Kline recommended I forward my question on to you.  If you have time to point me at the appropriate Windows API I’d surely appreciate it.

Michael Howard (Writing Secure Code) and this blog both suggest that UTF-16 has some 32-bit characters.  In the .NET Compact Framework code that I work with, I see lots of code like this:

int cchLengthInCharacters = cbLengthInBytes / sizeof(WCHAR);
// or
int cbLengthInBytes = cchLengthInCharacters * sizeof(WCHAR);

Yet my understanding of Writing Secure Code and of this blog leads me to believe that if you have even one 32-bit Unicode character in there, you could be asking for a buffer overflow with code like this.  I think the book suggests a Windows API call that should be made to evaluate the true character length of a Unicode string safely, but I can’t remember what it is.

Can you comment on this, and if you know the API Windows makes available can you let me know which it is?



Hmmm, kind of a trick question, that. One that I have sort of covered a few times in the past....

There are really three possible ways to answer the question "how big is that string?". There is

  1. what a Unicode Win32 API thinks of as a character -- a single WCHAR -- the sort of thing you get back from the [unmanaged] lstrlenW or the [managed] System.String.Length;
  2. what the Unicode standard thinks of as a character -- a single text element -- the sort of thing you get back from the [unmanaged] CharNextW (when it works) or the [managed] System.Globalization.StringInfo class (discussed here) and that Mark Davis (president of the Unicode Consortium) thinks of as a grapheme cluster;
  3. what a typical user who reads and writes in a language thinks of as a character -- a single sort element -- the sort of thing that no Win32 function returns directly but indirectly the unmanaged and managed collation functions/methods use and which I have also talked about previously.

The simple truth is that I find the blog post that Tim Bray wrote that concerned Andrew so much (I mean that one entitled Characters vs. Bytes) to be very confusing, mainly because it takes advantage of the fact that these three definitions plus a fourth one (the count of bytes) might be thought of as the same but truly are not. But it is written in a way that is easy to freak a person out who is worried about security.

However, in the context of SECURITY, only the first definition and fourth definitions are relevant to the "count of characters/count of bytes" issue and it is a bug to call anything else a secure answer. Surrogate pairs are ONE SMALL PART of this issue (and the most popular one for people who are trying to goose people on their assumptions in this space), but Ḝ (LATIN CAPITAL LETTER E+ COMBINING CEDILLA + COMBINING BREVE) is three WCHARs yet only a moron would not see that to most of the world this is one character even though to developers who have to deal with buffer allocation and WCHAR counts and such it is really secretly three.

You'd think, given

that these developers would readily embrace the two answers to the "how big is it?" question that provide the largest number. But it still comes up quite often as a security concern or as "proof" that UTF-16 is incomplete because of surrogate pairs needing more than one code unit, and the trend does not seem to indicate that the problem is getting any better. There are way too many developers who feel quite insecure about the size of their string members, and they really need to be reassured on this point!

I will be doing my part next week at the 30th Internationalization and Unicode conference, in my afternoon presentation. If you read here then I know that you do not have this sort of problem, but perhaps if you can be there you might learn some tricks to help others. :-)


This post brought to you by  (U+1e1c, a.k.a. LATIN CAPITAL LETTER E WITH CEDILLA AND BREVE)

Wilhelm Svenselius on 10 Nov 2006 7:05 AM:

So what was the answer to the question? How _do_ you reliably find the number of bytes in a UTF-16 string?

My best guess would be System.Text.Encoding.Unicode.GetByteCount( string s ). "Unicode" actually means UTF-16 here according to MSDN.

Michael S. Kaplan on 10 Nov 2006 7:13 AM:

The number of bytes is actually the number of "characters" (by definition #1) * sizeof(WCHAR)....

If you are in managed code, it is string.Length * 2 -- since you know the length without walking the string, it is faster to use that information (just like in COM).

Mihai on 10 Nov 2006 12:28 PM:

By the way security articles (well, I had to link somehow to the current topic :-)

Read here:

The article is "8 Simple Rules For Developing More Secure Code"

And rule number 6 is "Don't Write Insecure Code"

Duh! This alone is enough, you don't need any other rule!


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/01/24 Sometimes a WCHAR really *is* just a character....

go to newer or older post, or back to index or month or day