ASCII? no questions; I tell UNICODE lies

by Michael S. Kaplan, published on 2007/03/12 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/03/12/1862122.aspx


There have been a few questions coming my way lately about ASCII such as managed one this from Michael Liu:

Is there a WideCharToMultiByte equivalent in .NET 1.1/2.0 for the ASCII code page? I'd like to perform a best-fit mapping of non-ASCII characters (for a legacy system) but both Encoding.ASCII and Encoding.GetEncoding(20127) simply turn non-ASCII characters into ???????. The following code does what I want, but it uses reflection to access the internal CodePageEncoding type:

String s = "...";
Type t = Type.GetType("System.Text.CodePageEncoding", true);
Encoding e = (Encoding) Activator.CreateInstance(t, new Object[] { 20127 });
s = Encoding.Default.GetString(e.GetBytes(s));

And this unmanaged one from DavRis:

MultiByteToWideChar with US-ASCII specified as the codepage ignores the 8th bit of characters in the input.

So the bytes 0xC1 0xC2 0xC3 are turned into 'ABC' in the output.  This happens even if I pass the flag 'MB_ERR_INVALID_CHARS'.

This seems very odd to me.  I would expect MBtWC to fail on 0xC1 rather than interpret it as if it were 0x41.  Is there a way that this makes sense or a good reason for this?

As code pages go, it is to be honest not one that we have ever gone out of our way to do such a hot job with.

All of the implementations, from the pre 2.0 managed to the >= 2.0 managed to the unmanaged across various versions of Windows, have always done okay with actual ASCII data but not so well with data that one is trying to truncate down into ASCII (unless you do your own fallback work in .NET >= 2.0, I mean).

Though if you look at these examples, they are using encodings/code pages in a specific way.

My personal advice would be to not try and use an encoding as a text validation scheme -- do the conversion using the actual encoding that the text is in, and then if you need to validate it being in a specific subrange then do that separately. Even in Michael Liu's case where he is dealing with legacy data, it clearly isn't really ASCII. It would be a shame to lose data that way -- to have the data converted to Unicode by lying about where it came from. :-(

I'd also suggest being careful with reflection of internal classes that are subject to change in future versions -- or at the very least if it does go away due to a change in the internal implementation to not complain too loudly....

And of course never limit yourself to just ASCII, since that cut out proper representation of over 99% of the world's languages!

(Shawn might have some additional thoughts here; this post might inspire a riff from him!) 

 

This post brought to you by Â (U+00c2, a.k.a. LATIN CAPITAL LETTER A WITH CIRCUMFLEX)


bg on 12 Mar 2007 9:53 AM:

ASCII? no questions; I tell UNICODE lies > oi.... that has to be the worst one yet!

Shawn Steele - MSFT on 12 Mar 2007 4:47 PM:

http://msdn2.microsoft.com/en-us/library/tt6z1500(VS.80).aspx has a "Fallback Encoding Application Sample" that includes a fallback that uses normalization to get extended best-fit behavior.

Of course you might wanna see http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx "Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided"

Arun Philip on 13 Mar 2007 5:55 AM:

Ow, that's the cheesiest title I've seen in a long while!

~ Phylyp

Michael S. Kaplan on 13 Mar 2007 6:26 AM:

You and bg both thought so, Phylyp -- yet you have to admit it worked out as being quite accurate! :-)

John Cowan on 8 May 2008 2:21 PM:

Nah, it's "ASCII no questions, EBCDIC no lies."  Much better.


referenced by

2012/02/20 Where short file names can fail

2008/05/08 In hindsight, they may have BEST FIT these files where the sun never shines

go to newer or older post, or back to index or month or day