100% roundtrip ASCII? 100% roundtrip ANSI?

by Michael S. Kaplan, published on 2005/11/23 04:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/11/23/495193.aspx

Back in January I was talking about the new compiler error C4819 and how the compiler detected invalid characters.

And anyone who has been reading here knows that the reverse solidus is always the path separator, even when it looks like a yen or a won.

So among the so-called 'ANSI' code pages, ASCII (0x00 - 0x7f) will roundtrip 100% of the time.

How many "invalid" slots are there in the 'ANSI' code pages in the 0x80 - 0xff range, exactly?

Let's take a look at the Windows code pages:

874 (Thai) - 31 invalid
932 (Japanese Shift-JIS) - 5 invalid, 45 lead bytes, 15 reserved lead bytes = up to 65 invalid
936 (Simplified Chinese GBK) - 1 invalid, 126 lead bytes, 0 reserved lead bytes = up to 127 invalid
949 (Korean) - 2 invalid, 124 lead bytes, 2 reserved lead bytes = up to 128 invalid
950 (Traditional Chinese Big5) - 2 invalid, 87 lead bytes, 39 reserved lead bytes = up to 128 invalid
1250 (Central Europe) - 5 invalid
1251 (Cyrillic) - 1 invalid
1252 (Latin I) - 5 invalid
1253 (Greek) - 17 invalid
1254 (Turkish) - 7 invalid
1255 (Hebrew) - 23 invalid
1256 (Arabic) - 0 invalid
1257 (Baltic) - 12 invalid
1258 (Vietnam) - 9 invalid

There you have it. Code page 1256 is the only one that is guaranteed to be able to roundtrip every single code point without losing any of the bytes....

This post brought to you by "¿" (U+00bf, INVERTED QUESTION MARK)

# Ben Bryant on 23 Nov 2005 10:54 AM:

100% roundtrip ANSI is a novel concept -- I am not sure of the purpose but anyway I noticed you glossed over the fact that the double byte sets are hard to quantify the same as the single byte ones by saying "up to" rather than trying to get into how the lead bytes depend on the trail bytes, and a byte like 0x30 that looks like ASCII can fail the round-trip if it is a trail byte. (geeks!)

# Michael S. Kaplan on 23 Nov 2005 11:08 AM:

The 100% roundtrp question comes up in ways to encode bytes without having to worry about what someone may or may not consider invalid (UTF-8 has rules as does UTF-16, and they get stricter all the time). Nice that the Arabic code page stands before us as a safe haven from all that meddling. :-)

I suppose I could have written a program to go through each of the trail bytes for those lead bytes and gotten the % of invalids per lead byte and maybe averaged them, but it seemedlik a strange exercise that may not yield much additional info.

In the end, I guess I was mostly being lazy. Vacation does that....

# Maurits [MSFT] on 23 Nov 2005 11:27 AM:

You say that 1252 has 1 invalid character. I count 5.

# Michael S. Kaplan on 23 Nov 2005 11:40 AM:

Whoops, you are correct -- sorry, small transcription error!

# Mihai on 23 Nov 2005 12:26 PM:

I am not sure I understand.
Roundtrip to what?

# Nick Lamb on 24 Nov 2005 5:27 AM:

So, to use UTF-8 with the current/next Microsoft compiler you need to tell the OS that your locale uses Arabic codepage 1256 ?

And that's a "cool feature" ?

# Michael S. Kaplan on 24 Nov 2005 9:25 AM:

Nick, huh? That is not what I said.

What I am saying is that there are times when a person may not be sure of the code page. If you are not, then assuming it is UTF-8 and converting is guaranteed to cause problems -- because illegal sequences cannot be emitted. But cp1256 is a code page you can szafely roundtrip any byte sequence through without losing any bytes because all bytes are legal there.

Obviously if you know the code page you do not need this. So the times that this is needed will hopefully be rare. But most people roundtrip data through code pages 1252 when they have to do this sort of thing, which is incredibly dangerous since you can actually lose information!

# Nick Lamb on 24 Nov 2005 1:18 PM:

Maybe I'm jumping ahead here...

1. C4819 is generated for input which contains "invalid characters"

2. Windows isn't really UTF-8 capable because, well basically because Windows wasn't very well designed twenty years ago.

3. So to avoid C4819 you need a locale where all your 8-bit data, which Windows can't conceive of as Unicode, is "valid" even if meaningless.

4. In this post we find out that the locale needed is cp1256, Arabic.

Medinoc on 6 Sep 2011 8:04 AM:

The short internal links are dead...

Michael S. Kaplan on 6 Sep 2011 10:26 AM:

Yes, six years later, they moved everything. You have to go the new site to find them now....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/11/25 Azeri serious bug for non-Unicode SQL Server columns?

2008/07/03 Ignore what the label says -- it's Japanese

2006/05/26 Two chickens in every pot, and an ASCII in every code page

2006/03/17 On the fuzzier definition of a 'Unicode application' on Win9x....

2006/02/14 Every character has a story #18: U+06cc and U+064a (ARABIC LETTER FARSI YEH and ARABIC LETTER YEH)

2005/12/03 The default character is not always the question mark

go to newer or older post, or back to index or month or day