by Michael S. Kaplan, published on 2007/08/28 15:25 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/28/4616941.aspx
No, this post is not to do with the phenomenon sometimes referred to as 'beer goggles' in any way, shape, or form!
(by the way, if you search for that term on Google, would that make it become 'beer googles'?)
The other day Scott asked:
Im working on a really specialised text editor that is used for text from all around the world. To do this we are using Uniscribe to convert text to glyphs etc etc. Pretty normal stuff. We do wierd stuff with the glyphs in a printer driver later on!
However today Im looking at Bengali, in particular Bengali (Bangladesh), and I found a wierdness in IE that you might be interested in.
I have been cut-and-pasting text from webpages into my editor to validate that Im working OK. I have found a issue that is in my editor and in notepad!
If you look at the webpage:
If I cut and paste the text into notepad it looses it character order and becomes junk, but whats more If I save the web page locally and reopen it in IE it turns to junk!
I can fiddle with the character order manually to sort things out again, but thats not the point really!
Keep up the good blog work!
Interesting, it does indeed contain text that looks good:
until you try to put it somewhere else (at which point you get lots of dotted circles and such. Very odd!
I went down the hall to talk to Simon Daniels.
Like many people such as Raymond Chen and even myself sometimes, Simon is cursed with the burden of knowing stuff. And the problem with knowing stuff is that people will just randomly want to ask you stuff....
Anyway, he immediately realized what was probably going on. He viewed the source, got the link to the CSS file that was being used, and looked at it:
/* Embeded Font */
<!-- /* $WEFT -- Created on 7/16/2007 -- */
font-family: Bangsee Alpona;
<!-- /* $WEFT -- Created on 7/17/2007 -- */
And of course .EOT files created by WEFT (Web Embedding Fonts Tool) actually have the site that the .EOT was generated for embedded in them, so changing the link to remove the "www" so that the link didn't work showed very different results:
(if you look very carefully you will see lots of dotted circles spread throughout)
In the end, proper font creation following the rules that have been established in OpenType (e.g. this one for Bengali) is crucial. If the fonts you use don't follow those rules then you have to encode the text to match the expectation of the fonts, and then you have strange behavior any time the font in question is not available to you.
Now in fairness to the Bangsee Alpona font, it may be a perfectly valid one at this point, perhaps the version that was used to generate the .EOT files was from before the various changes within Unicode and then later to Microsoft to support the language properly -- and perhaps the editor for the content has some of the same problems -- so new content is created using this slightly different use of Unicode that is not the standard (thus creating text that will not always look right if you try to copy and paste it somewhere else that may not have the font:
১২ োসেੳটਹর োথেক রাজৈনিতক দেলর সেਔ অােলাচনা ੂরઔ
One of the reasons for the effort to provide a standard solution within Unicode is to keep under control the multiple contradictory methods of getting the rendering done, which is clearly what happens here....
This post brought to you by অ (U+0985, a.k.a. BENGALI LETTER A)
# Deepak on 28 Aug 2007 11:40 PM:
Interesting. The particular string you choose towards the end isn't even completely Bangla. The last character is Gurjarati and there are a few Gurmukhi characters in between! I see the same phenomena on the screenshot of webpage without the EOT file too. I think we have some encoding weirdness going on here.
# Michael S. Kaplan on 29 Aug 2007 12:19 AM:
Does the string mean anything? I kinda picked that one near the top at random....
# Deepak on 29 Aug 2007 6:37 AM:
Nope. Just a bunch of random characters. That too from more than just Bangla! The encoding seems to be borrowing characters generously from other indic scripts.
go to newer or older post, or back to index or month or day