The torrents of U+fffd (aka When security and conformance trump compatibility and reality)

by Michael S. Kaplan, published on 2007/09/17 00:01 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2007/09/17/4950277.aspx


(The purpose for the characters below should be apparent presently!)

U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd U+fffd 

This post is my personal opinion and I am not speaking officially on behalf of my current team, my former team, the executive staff of Microsoft, or anyone in between. They may agree with me, they may not, and if you want the quote from any of them, you'll have to ask them. You have been warned!

Remember when I posted Offense in depth in response to Shawn's Security patch MS07-040 for .Net 2.0 breaks some culture names for .Net 2.0 on Windows XP/2003/2000?

Well, as he later pointed out (in UTF-16, UTF-8 & UTF-32 update to conform with Unicode 5.0's security concerns), this change also includes the update that was done for the Change to Unicode Encoding for Unicode 5.0 conformance that is currently being reported as the cause for problems like Application.ExecutablePath cannot handle all characters and others.

You can see some of this described in MSKB 940521.

All of those problems have a single underlying cause.

and no I am not referring to the cause being the patch. :-)

That is why I said underlying. I am loojing for conceptual explanations, the kind that come up in RCA (Root Cause Analysis) conversations. Clasiming "the cause is the patch" in that context would just make you look foolish!

The unerlying cause is that whether you support the Unicode standard or not, whether you support the latest version or not, and whether you get the updates to managed/unmanaged code or not, not every piece of the underlying operating system supports this standard.

And that means that not everything that the OS can do is accessible to any method or function that changes Unicode encoding or normalization form.

In the one example feedback bug, for whatever crazy-ass reason, .NET does file/path stuff in UTF-8 in some part, perhaps shared code with some of the URI stuff? Whatever. But since they do that, there are pieces of the file system that are unavailable to some of their methods, in the name of security and conformance.

And so on, multiplied by potentially any managed (and in the future perhaps even some unmanaged?) entry points to developers to get their work done.

Personally, I think this is a bad idea.

I mean, there has to be an option to do these things. Obviously. The standard is there for a reason.

But do you tell a virus-checking program vendor that some files are not available to be scanned because they are not conformant to the Unicode standard, when they are legal in NTFS?

Due to a change in either a service pack or monthly security hotfix to the .NET Framework?

Ick.

In my opinion, there has to be a way to allow some of these things that were previously allowed to make sure such cases are not divorced from the realities that (a) the OS allows stuff and (b) the .NET Framework is as popular way to do stuff. These realities need to be cooperative, not competitive -- they must have a way to support each other, without thwarting each other.

Maybe people will decide this change had to happen. But I doubt everyone who was broken by it had the nature of the break communicated to them in time to do anything about it.

In the meantime, if you are a developer, please meet U+fffd. You may well be seeing a lot more of him....

 

This post brought to you by (U+fffd, a.k.a. REPLACEMENT CHARACTER)


# William Overington on Monday, September 17, 2007 5:33 AM:

Readers who are interested in trying to use U+FFFD with various packages, such as WordPad and Word and so on, might be interested to know that my Quest text font has a glyph for U+FFFD.  The glyph is a large pixelated "white" question mark upon a "black" background.

I am using Windows xp.  I tried Alt 65533 in WordPad and no glyph is displayed, though forward movement which may be that of a space character occurs.  65533 is the decimal equivalent of the hexadecimal value FFFD.  I used Insert Symbol with Word 97 and the glyph is displayed.  I copied the character onto the clipboard and pasted back into Word 97.  That works.  I tried pasting into WordPad and that did not work.

In Word 97 I tried reformatting the U+FFFD character using various well-known fonts and all of those which I tried resulted in black rectangles.

The Quest text font is available as a free download from the following link.

http://www.users.globalnet.co.uk/~ngo/QUESTTXT.TTF

The link and some notes about the font are available at the following web page.

http://www.users.globalnet.co.uk/~ngo/fonts.htm

There is a thread about the font at the following web page.

http://forum.high-logic.com/viewtopic.php?t=682

I mentioned the U+FFFD character in the fourth post in the following thread.

http://forum.high-logic.com/viewtopic.php?t=1102

William Overington

17 September 2007

# Dean Harding on Monday, September 17, 2007 7:32 PM:

But is not the difference between "before the update" and "after the update" that instead of dropping invalid data, it replaces it with U+FFFD? In that case, wouldn't it mean that those "valid in Win32 but not .NET" paths would have still existed before the update?

What I mean is that before the update, the code would do the transform like so:

"C:\\Hello\uD800World"  -->  "C:\\HelloWorld"

Whereas after the update, it would be:

"C:\\Hello\uD800World" --> "C:\\Hello\uFFFDWorld"

So there was always the problem of not being able to access valid Win32 paths from .NET -- the problem is just a little different.

At least, that's how I read it....

# Michael S. Kaplan on Monday, September 17, 2007 9:43 PM:

Well, that is part of it (and a not entirely good part, IMHO) but really any invalid sequences get dropped as well (imagine non-shortest form UTF-8 and more from UT8 URIs, other invalid sequences in almost any encoding, etc.) -- and suddenly there is a disconnect between managed and unmanaged that once worked asny now doesn't, in a hotfix or an SP, etc.


referenced by

2011/06/24 An irresistible force walks into an immovable object (aka the Thai that binds us)

2010/11/01 The consequences of being unintuitive and nonconformant

2008/11/13 No need to throw out the baby with the streamwriter; they probably could have just put in a replacement

2008/05/11 The vector of this spam is [apparently] indeterminate

2008/04/09 Fight the Future? (#11 of ??), aka Microsoft is giving this character nada weight but lotsa importance

2007/10/23 If working above U+FFFF is a problem n your program, then so is the basic stuff, too

go to newer or older post, or back to index or month or day