Don't sneak a BOM in on someone who promises to ignore free space

by Michael S. Kaplan, published on 2008/07/26 17:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/07/26/8776572.aspx


This is not a blog advising terrorists on how to circumvent the efforts of TSA inspectors!

Developer Sean mentioned:

Not sure who to address this to, but we just noticed that the wide string conversion functions don’t handle the whitespace Unicode markers (0xfeff).

The function where he first noticed the behavior in was in wcstoul, a function which clearly describes its whitespace behavior:

expects nptr to point to a string of the following form:

[whitespace] [{+ | –}] [0 [{ x | X }]] [digits]

A whitespace may consist of space and tab characters, which are ignored...

Okay, so it easy to see why he was expecting it to be ignored, but now leads us to wonder how wcstoul is deciding what "whitespace" is -- are they doing a simple check for tab and space?

The great thing about the C Runtime is that the source is right there so anyone can take a look. Let's do that now. The function can be found in VC\crt\src\wcstol.c, and the relevant bit of the function is:

    while ( _iswspace_l(c, _loc_update.GetLocaleT()) )
        c = *p++;       /* skip whitespace */

Ok, so the function skips the initial whitespace, like it claims to. But U+feff, the famous ZERO WIDTH NO-BREAK SPACE, obviously fails this particular test.

It turns out that iswspace and its cousins like _iswspace_l are using the character property information that comes out of the NLS GetStringTypeW function, which I have talked about before.

So where does GetStringTypeW decide what is a C1_SPACE or C1_BLANK?

This is something I mentioned last year in The difference between C1_SPACE-ing out and drawing a C1_BLANK, and clearly from that list you can see that although some space characters are covered there, ZERO WIDTH NO-BREAK SPACE is not -- because Unicode calls it a formating character (general category Cf), not a space -- and NLS goes along with that.

It turns out that the code in question was grabbing its source string from a file that started with a BOM -- which kind of points to the best way to resolve the problem: strip the BOM out since it is a part of the file "envelope" and not a part of its content....

 

This blog brought to you by U+feff, aka ZERO WIDTH NO-BREAK SPACE)


no comments

referenced by

2009/01/07 Someone please detect if there's a BOM before the plane takes off!

go to newer or older post, or back to index or month or day