by Michael S. Kaplan, published on 2006/09/12 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/09/12/750218.aspx
So a few weeks ago Doug Cook asked:
I've been banging my head aganst this for a while now. Maybe you'll have a clue.
I have some text processing utilities that need to be able to read, parse, manipulate, and write text. Right now, they treat everything as ASCII. Strangely, that actually works ok for a large number of cases. We only run into trouble when the second byte of a multi-byte character is a byte that the parser recognizes, such as 0x0D, 0x0A, or (this one seems to be the most troublesome) 0x22.
In any case, I need to fix things. Most of this is (more or less) straightforward, but there is one thing I can't figure out.
Is there any Windows API that allows me to safely convert from one encoding to another without processing the entire file in one shot?
Some of these files can be fairly big. It would be great to be able to read them in small chunks. I want to read 4k of the file, convert that much input into UTF-16, process it, then repeat, preserving leftover state (lead bytes, combining characters) between calls.
In .NET, this just works. You don't have to worry about it at all. You just say "open a file in encoding X, then give me a line of UTF-16 text", and you get it.
In UNIX, you use iconv. It gets a bit more complicated because you have to manage your own buffers, but it works. iconv stops when either the source or the target buffer fills up, tells you how much of both source and target buffers it used or filled, and saves state between calls.
In Win32, I can't figure out any simple way to do this. The only applicable tools seem to be IsDBCSLeadByteEx, MultiByteToWideChar, and WideCharToMultiByte. Using these plus some hand-rolled UTF-8 support, I guess I could come close (make sure I only pass in complete characters to MultiByteToWideChar, and make sure I don't end on an incomplete surrogate pair in WideCharToMultiByte), but I would still potentially be screwing up things like combining characters and other random stuff that I probably don't understand very well.
Is there any way to use the Windows APIs to do this? (Preferably in a way that works on Win2k.)
The truth is that Doug is right -- MultiByteToWideChar is particularly ill-suited to this type of scenario where the input might be streaming or somesuch.
Kind of a shame that the encoding functions are willing to use the target buffer even on error, yet it will not report how much was copied to the destination buffer or how much was copied from the source buffer, arguably two important things to know about when one does a partial conversion.
So it is smart enough to do the job, but not smart enough to tell you how far it got if it fails. :-(
Luckily, all hope is not lost. You can look to MLang, the MultiLanguage object, and in particular its IMLangConvertCharset::DoConversionToUnicode method, which basically just calls the IMLangConvertCharset::DoConversion method. Its pcSrcSize will, on the return of the function, will contain the number of bytes that were converted in the source string, and its pcDstSize will contain the number of UTF-16 code points. Everything that one needs to get this job done! :-)
If you don't want to create COM objects, you can even use its exports that it provides (ConvertINetMultiByteToUnicode and ConvertINetUnicodeToMultiByte) to get the same sort of thing done....
This post brought to you by ዬ (U+12ec, a.k.a. ETHOPIC SYLLABLE YEE)
Richard Smith on 12 Sep 2006 7:12 AM:
Nick Lamb on 12 Sep 2006 10:24 AM:
James Holderness on 13 Sep 2006 2:57 AM:
nik on 14 Nov 2006 12:15 PM:
0x0D, 0x0A as the second byte in a code page?
Is int not so that all the C0 characters do not occurr in any trail byte?
referenced by