No need to throw out the baby with the streamwriter; they probably could have just put in a replacement

by Michael S. Kaplan, published on 2008/11/13 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/11/13/9065138.aspx


So anyway, Kim's other recent blog, entitled Making a StreamWriter usable even after given garbage characters, highlights an interesting difference some of the methodology between the way that Windows and .Net handle encoding and codepages.

In Windows (in contrast to the behavior of most NLS API functions, as I have mentioned previously), the WideCharToMultiByte and MultiByteToWideChar functions will use the target buffer up until the point of failure, so that in the case of failure you may be able to do something with the partial results.

Now without a length indication the options of what can be done are more limited, but if nothing else then at least subsequent calls will not be affected by their predecessors.

.Net, on the other hand, has a default behavior here when you write to the stream that causes the StreamWriter to be useless.

The description in Kim's blog did not fully explain the problem, so I'll fill in the blank to it. :-)

She said:

For example, on an attempt to write U+DFC9, which is only half of a Unicode character (not a complete surrogate pair) an EncoderFallbackException was thrown

Now we have a stream here, so why is the stiry iver? Isn't the point of the stream thing that you can do it in chunks? Why would this be unrecoverable?

Well, the problem is that U+dfc9 is a low surogate.

See The basics of supplementary for a glossary update here!

As I mention in Why do the high surrogates have the low numbers? and other places, a surrogate pair is a high surrogate followed by a low surrogate.

A lone high surrogate is recoverable because it is incomplete.

But a lone low surrogate with no preceding high surrogate has no place to go, nothing to do -- it is toast unless you have a fallback plan in place, as Kim mentioned.

Though to be perfectly honest, after situations like that described in The torrents of U+fffd, I would much rather have had the default fallback plan be the U+fffd insertion.

I'm not a fan of the whole U+fffd thing, as I pointed out many times before. But given the huge push to change behavior from "drop illegal sequences" to "replace illegal sequences with the replacement character", I think behavior that did not throw in this case would have made for a better default....

And yes, I know there is a backcompat question here for the behavior, but since behavior was being changed anyway in this "in a service pack" change, there was a good opportunity to take a hard look at changing that default (since even already compiled applications were going to change their behavior!

 

This post brought to you by (U+fffd, a.k.a. REPLACEMENT CHARACTER)


MSDNArchive on 15 Nov 2008 5:11 AM:

Excellent! But you've set a dangerous precedent...I may start adding "todo: Michael" notes when I'd rather hand off Unicode explanations to the official Unicode bulldog!

MSDNArchive on 15 Nov 2008 5:12 AM:

btw, I wish I'd thought of that title. Right after posting I was actually lamenting my lack of "title flair"

Michael S. Kaplan on 15 Nov 2008 7:27 PM:

Well, you and I could occasionally collaborate on titles. :-)


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/11/01 The consequences of being unintuitive and nonconformant

go to newer or older post, or back to index or month or day