Hidden in plain site: a purloined letter kind of a bug report

by Michael S. Kaplan, published on 2011/03/09 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/03/09/10138478.aspx

Over in the Suggestion Box, Andrew Dunbar asked:

The Windows API WriteFile is documented to put the number of *bytes* written in the variable pointed to by lpNumberOfBytesWritten, but when writing to the console set to codepage 65001 (Unicode UTF-8) it actually returns the number of characters or Unicode codepoints written. Is this a bug or a feature or a misreading of the docs? There is a year old bug report but has anything been looked into over that year?


I'm glad that finding this bug led me to your blog by the way, I've been reading it regularly since.

Indeed, that Connect report is interesting:

I believe there is an issue with WriteFile [1]. Either in it's implementation, or its documentation is lacking.

WriteFile is supposed to return the number of _bytes_ written. Taking the same input, the returned value for number of bytes written can be different for different codepages. And when writing UTF-8 data with multibyte characters to stdout it seems to return the number of _characters_ written.

For example when writing a char* buffer with value { 0xC3, 0xA4, 0 } (character ä encoded as UTF-8) WriteFile returns the following values:

With console output codepage 850:
- Output: +ñ
- Bytes written: 2 (correct, and output is obviously broken)

With console output codepage 65001:
- Output: ä
- Bytes written: 1 (incorrect since "ä" in UTF-8 takes 2 bytes, but output is correct)

This behavior results in issues in Microsoft's C Runtime implementation. Several functions (like fflush) are required to verify that the number of bytes written is equal to the number of bytes in the input. The behavior of WriteFile triggers those checks and output streams are flagged with _IOERR, thus breaking programs that verify streams with ferror().

All tests were done on a German Windows 7 Professional (64-bit) using 32- and 64-bit test programs.

MSVCR90.DLL: 9.0.30729.4926
KERNEL32.DLL: 6.1.7600.16385
cmd.exe: 6.1.7600.16385

Also please refer to this thread [2] in the Visual C++ General forum where the CRT behavior was confirmed for VS 2010 RC.

[1]: http://msdn.microsoft.com/en-us/library/aa365747%28VS.85%29.aspx
[2]: http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/e4b91f49-6f60-4ffe-887a-e18e39250905


Posted by Microsoft on 3/22/2010 at 11:21 PM
Thanks for your feedback.

We are rerouting this issue to the appropriate group within the Visual Studio Product Team for triage and resolution. These specialized experts will follow-up with your issue.

Thank you

I can confirm that the bug appears to be just sitting there with the title Unicode issues with WriteFile an in CRT in the Dev10 PS database, with a "Release" field marked Dev11.

It is being treated as a CRT bug, even though it seems more like a Win32 issue and not a CRT issue.

So it is in the wrong database, for the wrong product, being looked at by no one since basically March of last year.

This may as well be considered a lost bug at this point. Kind of a lost in plain site bug, now that I think about it....

The truth is, there is a very good reason that the whole WriteFile/WriteConsoleW split where if you call the wrong function when dealing with consoles and/or redirected consoles you will get screwed up results.

It is because the relationship between these functions and their behavior in the various supported and "unsupported" situations is really screwy.

Perhaps the bug getting lost ain't no thing since it is largely wrapped up in there and improvong the documentation to make it clear that if you are writing to the console you should always use WriteConsoleW and not WriteFile in order to avoid less than predictable results is probably the only fix that one might expect here.

Though the current WriteConsole docs are pretty clear on the situation:

WriteConsole fails if it is used with a standard handle that is redirected to a file. If an application processes multilingual output that can be redirected, determine whether the output handle is a console handle (one method is to call the GetConsoleMode function and check whether it succeeds). If the handle is a console handle, call WriteConsole. If the handle is not a console handle, the output is redirected and you should call WriteFile to perform the I/O.

Perhaps being just as clear in the WriteFile docs will clean this situation up sufficiently....

Anyway, I'll give the report a few weeks to see if someone cleans up the Connect bug and the Product Studio issue and gets it made into a doc bug, if not then I'll follow up on it. this might be perceived as me being lazy, but in a lot of cases the right people really do happen to be reading here. And when they are everything is much easier than tracking the right people down....

Random832 on 21 Mar 2011 6:15 AM:

The reason it's [at least also] a CRT bug is probably because 'fixing' the documentation to tell you not to call WriteFile on the console won't actually cause the CRT to not call WriteFile on the console, and (unlike for WriteFile) it would be completely unreasonable to say you shouldn't call fputs, or cout << "whatever", if stdout is a console. The current behavior is a problem for whatever version of the ISO C/C++ standards VS attempts to conform to, and if Win32 won't change, then the CRT has to.

(There's also the fact that if a multibyte character is split across a _write() boundary it will be lost if written to the console, which probably _can't_ be solved at the WriteFile level)

Roman on 4 Dec 2012 5:28 AM:

Thank you for quoting the report, since the original link is no longer operational (like tens of thousands other links leading to MS Connect...)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day