Documentation does not always imply existence

by Michael S. Kaplan, published on 2007/08/12 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/11/4340636.aspx

You ever hear those fun stories about people who the fast food drive thru had to serve because someone left the outside sign's lights on?

Well, software doesn't really work that way.

So documentation does not always imply existence; sometimes, documentation just implies doc bugs!

Like the other day over in the microsoft.public.dotnet.internationalization when Bob Bins asked:

If you look at the MSDN sample for Encoding.GetEncoding Method (Int32) it shows the below sample code..

      // Get a UTF-32 encoding by codepage.
      Encoding e1 = Encoding.GetEncoding( 65005 );

      // Get a UTF-32 encoding by name.
      Encoding e2 = Encoding.GetEncoding( "utf-32" );

      // Check their equality.
      Console.WriteLine( "e1 equals e2? {0}", e1.Equals( e2 ) );

The problem is that the first line throws an exception:
System.NotSupportedException Additional information: No data is available for encoding 65005.

If I look at the code page property when using the "utf-32" string it says 12000.

Which one is correct? Why does the sample show 65005 as utf-32 when the function thinks 12000 is the utf32 codepage?

That is indeed the sample I found for the Encoding.GetEncoding Method (Int32), as Bob indicated.

Of course if you look at the UTF32Encoding class documentation it pretty clearly says what the the code page values are:

UTF32Encoding corresponds to the Windows code pages 12000 (little-endian byte order) and 12001 (big-endian byte order).

This is clearly a bit of a misnomer since no version of Windows that has ever shipped recognizes those code page values as being valid, which makes calling them Windows code pages a bit of an overstatement. I actually even mentioned them myself a couple of years ago back in Not every code page value is supported and the title alone makes it clear that not every code page value one might see is one that the operating system is going to recognize.

Now this problem seems pretty widespread in the docs (just search MSDN for that 65005 code page value to see what I mean -- there are enough instances of this problem to make it seem systemic!), but the most interesting one is in the Encoding.WindowsCodePage Property topic which has a huge list disguised as a code comment with several of these incorrect values. Understanding the source here might be a worthwhile exercise.

In any case, as MVP Mihai Nita pointed out, using the UTF32Encoding class is a much better idea for getting UTF-32 since that way both byte-endian-ness and BOM info can be more readily tailored....

This post brought to you by ż (U+017c, a.k.a. LATIN SMALL LETTER Z WITH DOT ABOVE)

Mihai on 13 Aug 2007 2:23 PM:

Actually, having UTF-32LE and UTF-32BE as Windows code pages (usable with MultiByteToWideChar/WideCharToMultiByte) would be handy. Same UTF-16BE, if I think of it :-)

I mean, UTF-8 is there, so why not other

"UTFs"?

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day