In hindsight, they may have BEST FIT these files where the sun never shines

by Michael S. Kaplan, published on 2008/05/08 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/05/08/8466991.aspx


Recently while paying attention to The Unicode List I was once again reminded why I don't pay more attention to The Unicode List. :-)

Specifically it was a thread started by Andreas Prilop:

I refer to
 
http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
 http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

In ISO-8859-1, code position 0x90 is mapped to U+0090.
In Windows-1252, code position 0x90 is listed as "undefined".

Why are they treated differently? International Standard ISO/IEC 8859-1 does *not* define code position 0x90. So it might also be listed as "undefined".

Or, for purely practical reasons, 0x90 in Windows-1252 might also be mapped to U+0090.

This different behaviour for undefined code positions may occasionally cause trouble - please see
 
http://lists.w3.org/Archives/Public/www-validator/2008Apr/
 http://lists.w3.org/Archives/Public/www-validator/2008May/
Thread "Fallback to UTF-8".

Richard Wordingham then responded in a message that should probably could have resolved the issue (had it not been The Unicode List):

0x90 is defined in the IANA version of ISO-8859-1, which calls up the description in RFC1345.  In a web context, I believe the IANA definition should take precedence over ISO/IEC.

On the other hand, Windows-1252 might be extended again and assign a meaning to 0x90, so it is probably better not to map any Unicode codepoint to that value.

> Or, for purely practical reasons, 0x90 in Windows-1252 might
> also be mapped to U+0090.


Which is reported to be what Windows *currently* actually does.

And Unicode cool guy Ken Whistler put in some thoughts here as well:

> > Why are they treated differently?

Different theory by the maintainers of the two sets of files.

I am the most recent maintainer of record for the 8859-X mapping files posted on the Unicode website. For those I follow the consensus of the UTC that mappings for control code points in the 8859-X family of ASCII-derived encodings to/from Unicode is least problematical if 0x00 <--> U+0000, 0x01 <--> U+0001, etc. This is, in fact, the way that almost all commercial conversions handle the control code conversions for 8859-X character sets.

Since 8859-1.TXT and the other mapping tables posted on the Unicode website are intended to provide practical *mapping* guidelines for implementers, it would be pedantic in the extreme (and counterproductive) to post them up as documentation of the 8859-X standards *without* the control code mappings.

The Microsoft mapping tables are contributed by and maintained by Microsoft, and follow Microsoft standards practice for table definition. 0x00..0x1F are mapped through to U+0000..U+001F, but because most Microsoft code pages contain graphic characters in the 0x80..0x9F range, those characters are mapped, but unassigned code points are simply left #UNDEFINED, as is also the case for Microsoft double-byte code pages. This allows a distinction to be made between that status and #DBCS LEAD BYTE values.

In practice, of course, when actually implmenting conversion tables from Microsoft code pages to/from Unicode, nearly all commercial implementations, including Microsoft's, map undefined values in the 0x80..0x9F range (for non-DBCS code pages) to the corresponding Unicode U+0080..U+009F control code character, rather than to U+FFFD.

> > International Standard ISO/IEC 8859-1 does *not* define
> > code position 0x90. So it might also be listed as "undefined".
>
> 0x90 is defined in the IANA version of ISO-8859-1, which calls up the
> description in RFC1345.  In a web context, I believe the IANA definition
> should take precedence over ISO/IEC.


While I agree with the conclusion that for web usage, mappings that map through control codes rather than treating them as undefined is the correct thing to do -- I do so for different reasons.

RFC 1345 is *extremely* dated. It is from 1992, and refers to prepublication versions of 10646. The first edition of 10646 wasn't even published until 1993, and at that point we are talking about a Unicode 1.1-level repertoire. The character mnemonic table in RFC 1345 is full of errors, and the mapping tables for various charsets at the end of RFC 1345 have not been updated to track the updates of the 8859 standards nor the updates in mapping practice for some charsets that resulted from extensions to 10646.

>
> On the other hand, Windows-1252 might be extended again and assign a meaning
> to 0x90, so it is probably better not to map any Unicode codepoint to that
> value.

I disagree. If you do not map U+0090 to 0x90 for Windows-1252, all you are doing in ensuring an interoperability bug both with Windows and with other commercial applications doing conversions.

After that, David Starner and Doug Ewell made contributions pointing out that if one was truly expecting C1 control characters like U+0090 to be in cp1252 then one probably had bigger problems with one's data, anyway.

Mark Davis pointed out that ICU does indeed map 0x90 to U+0090 and vice versa, since "in ICU we always go by what people do, and not what they say.... Windows itself maps 0x90 to U+0090."

Good to know! :-)

Andreas came back with one more contribution to explain the reasoning behind the original concern:

The problem was/is: What to do when a byte 0x90 is found in a file that has

(a) erroneously charset=ISO-8859-1
(b) erroneously charset=Windows-1252
(c) no encoding/charset at all specified

Surprisingly, the W3C validator gives up with Windows-1252 but does perform a check with ISO-8859-1.

See the test document  
http://www.unics.uni-hannover.de/nhtcapri/test.htm and follow the links "Validate as ISO-8859-1" and "Validate as Windows-1252".

The validation report with Windows-1252 would be more helpful, in my opinion, if 0x90 in cp1252 is mapped to something - to U+0090 or whatever.

And Richard made the final contribution to date in response to the "surprise" in that last message from Andreas:

It's not surprising at all.  These charsets designations have the *IANA* definitions, which are not necessarily identical to international (e.g. ISO-8859 series) or national (e.g. TIS-620) standards.  Thus 0x90 is undefined for Windows-1252 but merely an illegal character for HTML in the IANA definition of ISO-88591.

It is funny how these things go, really.

Though this one was mercifully short, at least. :-)

I was surprised (though not surprised enough to extend the thread by volunteering the information!) that even better than the referenced:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

or the nicer, more graphical though functionally equivalent:

http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx

There is an even better reference to look at, one also hosted on the Unicode site:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

This file, along with the rest of the mappings at

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/

come directly from Microsoft.

And while they were provided primarily to officially document in public the "best fit" mappings in these code pages, they have additional benefits, as they are the literal source files used to build the code page data files, with no lines removed.

In particular, you can see entries like the following in there if cp1252:

0x81 0x0081
0x8d 0x008d
0x8f 0x008f
0x90 0x0090
0x9d 0x009d

Now compared with the one that the file they referenced had:

0x81       #UNDEFINED
0x8D       #UNDEFINED
0x8F       #UNDEFINED
0x90       #UNDEFINED
0x9D       #UNDEFINED

I think it seem much nicer and definitely matches the assertion that several people raised about the de facto behavior of the code page.

These files also define all the "best fit" mappings, so that (for example) cp1252 file lists all of the following:

0x0041 0x41 ;Latin Capital Letter A
0x0100 0x41 ;Latin Capital Letter A With Macron
0x0102 0x41 ;Latin Capital Letter A With Breve
0x0104 0x41 ;Latin Capital Letter A With Ogonek
0x01cd 0x41 ;Latin Capital Letter A With Caron
0x01de 0x41 ;Latin Capital Letter A With Diaeresis And Macron
0xff21 0x41 ;Fullwidth Latin Capital Letter A

which makes all of these characters other than the one that can do round trip mapping to U+0041 map to somewhere. It is why the WCTABLE has 698 entries even though the MBTABLE only has 256. :-)

Maybe I should have volunteered the information. :-)

But The Unicode List just plain scares me.

Even more that clowns do, and you know how creepy clowns are....

Now I have talked about best fit mappings in Windows code pages in all of the following blogs in this Blog:

and more -- I just got tired of putting in links. :-)

Anyway, if one uses these "hidden" yet publicly posted files, the guesswork and reverse engineering requirements are gone, as is the idle speculation. You can just look it up if you want the answer!

Though on The Unicode List I am sure that idle speculation will continue, on other topics if not this one....

 

This blog brought to you by U+0090, of course!


# Andrew West on 8 May 2008 5:34 PM:

I'm not sure about the Unicode list, but I'm with you 100% on the clown thing.

# Zooba on 8 May 2008 5:35 PM:

Unless you've snipped out part of the original post, the question is very unclear. Personally I prefer to clarify (or abuse, depending on how unclear) before I move on to a long answer.

Also, http://www.grods.com/post/2472/ can help you with your clown problem :-)

# Michael S. Kaplan on 8 May 2008 5:54 PM:

Actually, I have found clowns to be creepy since before I ever saw that Seinfeld episode (which was funny but the whole "crazy clown" thing is not exactly why I think they are creepy).

But they really are creepy, kind of the way I'd find sexual harassment panda to be if he were not a fictional creature...


referenced by

2012/02/20 Where short file names can fail

go to newer or older post, or back to index or month or day