Don't want to convert all at once? Well, maybe you could just nibble?

by Michael S. Kaplan, published on 2006/12/10 12:30 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/12/10/1252776.aspx


So the other day, Snaury asked me:

Hi Michael,

I found your blog while searching for information on converting between code pages, and thought, maybe you could answer the question I have for a very long time?

Today was my first attempt of programming with .NET, and the first application I did was a small character sets conversion application. And while reading documentation I suddenly found that .NET Encoder/Decoder classes can encode/decode incomplete streams (i.e. those that have incomplete characters, like in the example showing teared apart UTF-8 character). I thought: wow!!! The question that worries me for a very long time is, however, if it is possible to do the same without using .NET? For example, imagine I have a 1GB file in some encoding (that has heavy MBCS), and I want to reencode it into some other encoding. With MultiByteToWideChar/WideCharToMultiByte I would need to read the whole file (1GB!) in memory, then allocate buffer for unicode version (2GBs!), convert it to unicode, then convert it to another codepage, and finally save. What I basically want is a way to load a portion of a file (like 1MB for instance), try to convert all complete characters, then load another 1MB,  and continue convertion from the incomplete character I had in previous call. With MultiByteToWideChar/WideCharToMultiByte calls if I don't load 1MB chunk carefully I might either end up with a default character at the end, or the whole conversion fail, without even a chance to know where exactly conversion failed, and to get what converted successfully. So, I wonder if it is possible to do without .NET and using Win32 API only? Is there any functions that could tell me how much characters it converted, and where conversion stopped in the source string?

Thanks in advance,
Snaury.

Kind of similar to Doug Cook's question I covered in Converting, but not all at once?. In that post I pointed out the MLang solution to the problem, but several people in the comments pointed out that they have had problems with this approach.

So I thought maybe I'd suggest an easier way, fraught with fewer dependencies and less peril. :-)

We'll start with the information in Getting exactly ONE Unicode code point out of UTF-8. Not the actual problem that was solved there, but the table that had the distribution of bits within the bytes of legal, valid UTF-8:

1st Byte 2nd Byte 3rd Byte 4th Byte
0xxxxxxx    - - -
110yyyyy    10xxxxxx    - -
1110zzzz    10yyyyyy    10xxxxxx    -
11110uuu    10uuzzzz    10yyyyyy    10xxxxxx   

So it seems pretty easy to just plunk into the byte stream at whatever point you want, then perhaps look at some of the surrounding bytes and then either add or subtract from 0-3 bytes based on what you see, and then call MultiByteToWideChar with that adjusted value.

Anyone want to take a stab at the nibble function one would write here to call before the call to MultiByteToWideChar?

Now obviously this gets a little bit more complicated if one is also dealing with DBCS code pages, but there is a function that will actually return the lead byte ranges (GetCPInfo). Anyone want to take a stab at that one, too? :-)

 

This post brought to you by 𐌐 (U+10310, a.k.a. OLD ITALIC LETTER PE(


Igor Tandetnik on 10 Dec 2006 1:47 PM:

> Now obviously this gets a little bit more complicated if one is also dealing with DBCS code pages, but there is a function that will actually return the lead byte ranges.

Is it true for any DBCS encoding (as it is for UTF-8) that a trailing byte of a pair cannot be in the leading byte range? In other words, does there exist a DBCS encoding that represents at least one character as a pair "XX YY" where both XX and YY are leading bytes?

I was under impression that such encodings do exist (e.g. Shift-JIS), and this property makes it rather challenging to walk back in the stream to find a character boundary. In the worst case, one has to walk all the way to the beginning of the stream to find out which byte is which.

Michael S. Kaplan on 10 Dec 2006 2:30 PM:

Acually, it is not quite as bad as that, I believe (statistically speaking). The theoretical string made up entirely of lead bytes and trail bytes that are also able to act as lead bytes is quite unlikely in practice....

Michael S. Kaplan on 10 Dec 2006 2:35 PM:

And of course the question is then -- what is the best/easiest way to detect where one is? :-)

Erzengel on 10 Dec 2006 6:28 PM:

Well, a function for UTF8:

[code]

//Returns the number of bytes in the string that are complete unicode characters
//Warning: This is for partial loading where only the end is cut off. Do not use this for any sort of UTF8 validation or on inconsistent data.

unsigned int UTF8_CompleteLength(const char* UTF8_String, unsigned int IncompleteLength)
{
  const char UTF8_MultiByte_Mask = 0x3 << 6;
  const char UTF8_MultiByte_Control = 0x2 << 6;
  const char UTF8_SingleByte_Mask = 0x1 << 7;
  const char UTF8_SingleByte_Control = 0;
  unsigned int pos = IncompleteLength - 1;

  if((UTF8_String[pos] & UTF8_SingleByte_Mask) == UTF8_SingleByte_Control)
     return IncompleteLength;

  for(; pos >= 0; --pos)
  {
     if((UTF8_String[pos] & UTF8_MultiByte_Mask) != UTF8_MultiByte_Control)
     {
        if((UTF8_String[pos] & UTF8_SingleByte_Mask) != UTF8_SingleByte_Control)
           return pos + 1;
        else
           throw std::exception("Reached a single unicode character without completing unicode character");

        //TODO: Additional error checking to ensure that we don't go past the maximum number of characters for the control bits. Note: You can go UNDER.
     }
  }

  throw std::exception("No complete unicode characters were found");
}[/code]

But they want a standard function that would work with all code pages. Which is why we like .net, yes? :)

Mihai on 11 Dec 2006 1:14 AM:

<<Acually, it is not quite as bad as that, I believe (statistically speaking). The theoretical string made up entirely of lead bytes and trail bytes that are also able to act as lead bytes is quite unlikely in practice....>>

Although unlikely, a function cannot really work based on statistics (unless is called IsStringUnicode :-)

So one will have to iterate the whole buffer (not so nice, but is the only way).

Then you should make sure you don't deal with GB-18030, where GetCPInfo is not enough :-)

Michael S. Kaplan on 11 Dec 2006 1:48 AM:

Well, call that the worst case. What would the function look like?


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/10/23 If working above U+FFFF is a problem n your program, then so is the basic stuff, too

go to newer or older post, or back to index or month or day