Anyone out there switching modes in JIS?

by Michael S. Kaplan, published on 2006/12/25 13:59 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/12/25/1362639.aspx


The report was that a specific sequence of bytes was failing conversion via code page 50220 to Unicode when using MultiByteToWideChar but succeeding when using MLang. The bug only repros on XP (in Server 2003) the conversion was working in both technologies.

Now as I pointed out in All code page architectures are created equal, some really are more equal than others. So let's take a look at this "MLang being more equal than Win32 on XP" case, shall we?

An excerpt of the byte sequence that shows the problem is:

1B 24 42 2D 21 2D 22 2D 23 2D 24 2D 25 2D 26 2D 27 2D 28 2D 29 2D 2A 1B 28 4A 1B 24 42 2D 2B 2D 2C

Breaking it down a bit:

1B 24 42 2D 21 2D 22 2D 23 2D 24 2D 25 2D 26 2D 27 2D 28 2D 29 2D 2A 1B 28 4A 1B 24 42 2D 2B 2D 2C

Now you will notice that the sequence in pink and the later one in red are the same. Yung-Shin did an analysis of what was going on:

The 3-byte escape sequence in pink switches the mode to JIS X 0208-1983 mode, and two additional 3-byte sequence (1b, 28, 4a and 1b, 24, 42) switches the mode to JIS-Roman and back to JIS mode X 0208-1983 again.  This is actually unnecessary because it’s already in JIS mode already.  However, in XP, these bogus escape sequence causes it to exit the loop and returns from the MB2WC call.  In a word, if bogus escape sequence like this, the bogus sequence will truncate from the bogus escape sequence.  If there are no bogus escape sequence (i.e. removing the red bytes), XP will convert the string just fine.

This bug is fixed in Server from Shawn by continuing the loop so its mode is switched correctly.

Now I write blog posts in tools that have no problem inserting HTML like

<font face=Tahoma>Hello </font><font face=Tahoma>Dolly!</font>

(note the completely bogus bit converting out of and right back into the exact same font in red)

So I can believe there might indeed be editing tools that might do the same thing with encodings that use escape sequences to switch in and out of different modes.

The file itself that containing the errant sequences may well have just been set up to test the specific case, so it may or may not be proof of a real need to do something better here in XP.

But I was wondering if anyone out there had run into this specific bug before, and whether it was blocking them (because writing code to prefilter the bytes would actually be quite a pain to do for obvious reasons).

So, is there anyone using ISO 2022 code pages running into this problem?

 

This post brought to you by U+000e and U+000f (a.k.a. C0 control characters representing SHIFT IN and SHIFT OUT)


# NikWeber on 26 Dec 2006 10:16 PM:

Yes. I've run into massive problems.

When converting from unicode to ISO2022

MLANG will issue a final shift sequence back to JIS roman.

WIN32 (MultiByteToWideChar) will not.

Also there are discrepancies in the conversion tables between MLANG and WIN32.

U+2172 (small roman numeral three) has differing encodings; in fact they use different shift sequences.

There are more discrepancies that cause severe issues when SEARCHING ISO2022 text for expressions.

Beacuse it depends what API the program used to serialize the data.

Dominik

# Michael S. Kaplan on 27 Dec 2006 1:22 AM:

Well, of course that is a whole different set of issues, rather than the specific one here that is about one MultiByteToWieChar problem that is much more serious than just differing tables or implementations (it is a bug leading to input truncation with specific admittedly bogus input)....


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day