What is wrong with that web page?

by Michael S. Kaplan, published on 2005/10/28 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/10/28/486034.aspx


The other day, my manager's manager's manager Delan was looking at various web sites and on one of them there was an unusual display issue on several pages.

Basically, each of the pages in question had text on them that looked like this:

"?/font>"

Very odd, and maybe a little frustrating too.

Of course when you are the head Globalization Infrastructure, Fonts, and Tools, what better place to start then with some of the font experts on your team? I mean, clearly something was messing with a font tag....

In actuality it wasn't a font issue. After a few hours, the net was widened a bit and I ended up on a mail. It kind of reminded me of my MSLU days when a string was converted to the wrong code page when just a few characters were wrong. So based on that, I responded thusly:

I would suggest looking at the source on the page to see what might be next to those font tags, and check the IE detected encoding to see if matches the page's encoding -- it may be a CJK font name that is being misunderstood and combined with bytes of the less than sign.

The page was sent on to me. So I set the encoding in IE (which for me was going through AutoDetect thinking the page was in Windows 1252) to be Chinese Traditional (Big5), and suddenly all of the news items that had bullets (0x95 or U+2022) wrapped in <font> tags had seen the bytes of the bullet and the the less than sign turned into a question mark.

Now as it turns out she was actually having the page Auto Detected as being Chinese Simplified (GB2312), but the results were the same -- U+2022 U+003c (•<) which for me was being read as 0x95 0x3c was for her being converted to "?" (since 0x953c is undefined on both code pages 936 and 950, in the former a lead byte with an illegal trail byte and in the former an unused lead byte with no assigned trail bytes).

The page itself:

http://local.msn.com/t3/?zip=93301

had no charset meta tag and clearly the server was not communicating the charset. We both had the AutoDetect checkbox set (IE6 for me and IE7 for her), but clearly it was not detecting much to distinguish the page from the bias of our own individual locale settings.

Wouldn't the illegal sequence have been a good indication that the AutoDetect guess was wrong? And isn't the lack of any charset bad too? And that lack of other communication about the charset from the server?

Of course it was a page from MSN, so I figure we found at least three bugs in various Microsoft offerings from the exercise, which was actually a lot of fun, too! :-)

 

This post brought to you by "" (U+2022, a.k.a. BULLET)


# CornedBee on 28 Oct 2005 5:39 AM:

Lesson learned: validate!
Because the W3C validator catches missing encoding declarations (which are misnamed charset declarations).

# Will on 28 Oct 2005 7:06 AM:

Of course, I feel bound to make some criticism of a layout which has list bullets wrapped in font tags.

<UL> is hardly a difficult concept.

# Ben Bryant on 28 Oct 2005 7:43 AM:

I always love your blog, but I found this confusing: "U+2022 U+003c (•<) which for me was being read as 0x95 0x3c was for her being converted to "?" " Once I realized that the U+2022 is 0x95 in Windows-1252 then it was clear that for *both* of you the bytes you were seeing were 0x95 0x3c. So you were both trying to interpret 0x95 0x3c in your respective double byte encodings (e.g. 0x95 is a lead byte and 0x3c is not a valid trail byte in either, therefore the question mark).
To say that you have U+2022 U+003c in the page is incorrect since it is really a matter of guessing. It is also misleading to use the Unicode way of expressing those characters because no Unicode encoding is involved AT ALL until the browser's internal representation of the page. Those are the intended characters, yes, but I'm just trying to explain why I found your explanation confusing.

# Maurits [MSFT] on 28 Oct 2005 2:20 PM:

http://www.w3.org/TR/REC-html40/sgml/entities.html
<!ENTITY bull CDATA "&#8226;" -- bullet = black small circle,
U+2022 ISOpub -->

Jimmy on 31 Oct 2011 8:30 AM:

It could be a bug of IE or simply the error of the coder. I wonder why people still use the old way to stylize though. CSS is a lot more efficient and it can eliminate almost formatting errors.

Jimmy from learnhowtomakearesume.com


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day