Converting, but not all at once?

by Michael S. Kaplan, published on 2006/09/12 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/09/12/750218.aspx


So a few weeks ago Doug Cook asked:

I've been banging my head aganst this for a while now. Maybe you'll have a clue.

I have some text processing utilities that need to be able to read, parse, manipulate, and write text. Right now, they treat everything as ASCII.  Strangely, that actually works ok for a large number of cases. We only run into trouble when the second byte of a multi-byte character is a byte that the parser recognizes, such as 0x0D, 0x0A, or (this one seems to be the most troublesome) 0x22.

In any case, I need to fix things. Most of this is (more or less) straightforward, but there is one thing I can't figure out.

Is there any Windows API that allows me to safely convert from one encoding to another without processing the entire file in one shot?

Some of these files can be fairly big.  It would be great to be able to read them in small chunks. I want to read 4k of the file, convert that much input into UTF-16, process it, then repeat, preserving leftover state (lead bytes, combining characters) between calls.

In .NET, this just works.  You don't have to worry about it at all.  You just say "open a file in encoding X, then give me a line of UTF-16 text", and you get it.

In UNIX, you use iconv.  It gets a bit more complicated because you have to manage your own buffers, but it works.  iconv stops when either the source or the target buffer fills up, tells you how much of both source and target buffers it used or filled, and saves state between calls.

In Win32, I can't figure out any simple way to do this.  The only applicable tools seem to be IsDBCSLeadByteEx, MultiByteToWideChar, and WideCharToMultiByte. Using these plus some hand-rolled UTF-8 support, I guess I could come close (make sure I only pass in complete characters to MultiByteToWideChar, and make sure I don't end on an incomplete surrogate pair in WideCharToMultiByte), but I would still potentially be screwing up things like combining characters and other random stuff that I probably don't understand very well.

Is there any way to use the Windows APIs to do this? (Preferably in a way that works on Win2k.)

The truth is that Doug is right -- MultiByteToWideChar is particularly ill-suited to this type of scenario where the input might be streaming or somesuch.

Kind of a shame that the encoding functions are willing to use the target buffer even on error, yet it will not report how much was copied to the destination buffer or how much was copied from the source buffer, arguably two important things to know about when one does a partial conversion.

So it is smart enough to do the job, but not smart enough to tell you how far it got if it fails. :-(

Luckily, all hope is not lost. You can look to MLang, the MultiLanguage object, and in particular its IMLangConvertCharset::DoConversionToUnicode method, which basically just calls the IMLangConvertCharset::DoConversion method. Its pcSrcSize will, on the return of the function, will contain the number of bytes that were converted in the source string, and its pcDstSize will contain the number of UTF-16 code points. Everything that one needs to get this job done! :-)

If you don't want to create COM objects, you can even use its exports that it provides (ConvertINetMultiByteToUnicode and  ConvertINetUnicodeToMultiByte) to get the same sort of thing done....

 

This post brought to you by (U+12ec, a.k.a. ETHOPIC SYLLABLE YEE)


Richard Smith on 12 Sep 2006 7:12 AM:

... or you could use iconv. It works under Win32...

Nick Lamb on 12 Sep 2006 10:24 AM:

Judging from the documentation, or lack of it, for these MLang functions I wouldn't trust them as far as I could throw them.

Converting between encodings is hard, anyone who has thought about it enough to get it right would have a lot more to say than this†. If the input or output buffers are too short to make progress, what happens? Is that considered an error? What happens to invalid characters or non-roundtrip transformations ? Are such characters replaced (with what) ? Do they report an error? What is in the size variables when an error occurs?  Can we rescue our conversion in this scenario ? If the encoding I want to use isn't included, how do I add it?

The GNU iconv documentation answers all these questions, the documentation for ConvertINetMultiByteToUnicode was written by someone who hasn't considered those questions and maybe that means the people who implemented these functions didn't either, in which case your data is not in safe hands. Run in the opposite direction as fast as you can.

Also this function is part of Internet Explorer, and we know from previous articles that Internet Explorer gives incorrect results when transcoding, up until at least IE 7. Unless bug compatibility with old versions of IE is important you don't want that.

† On the other hand MultiByteToWideChar is proof that you can say a lot on this subject and still not be getting it right.

Oh, and those would be UTF-16 code _units_ Michael, not code points.

James Holderness on 13 Sep 2006 2:57 AM:

I've had the pleasure of using the ConvertINetXxxToXxx functions in a streaming situation and I can confirm that they don't work well at all. If the destination buffer isn't big enough, they will return an error and set the destination count to zero (which isn't at all useful). The only solution I could see was to keep repeating the function call with a smaller source length (my destination buffer was a fixed size) until it returned a success.

And that wasn't the only problem - there were other weird quirks too. Unless you're writing code that has to deal with dozens of different charsets you're probably better off avoiding the OS provided functions and writing your own.

nik on 14 Nov 2006 12:15 PM:

0x0D, 0x0A as the second byte in a code page?

Is int not so that all the C0 characters do not occurr in any trail byte?


referenced by

2006/12/10 Don't want to convert all at once? Well, maybe you could just nibble?

go to newer or older post, or back to index or month or day