Every character has a story #16: U+0084

by Michael S. Kaplan, published on 2005/12/16 17:15 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/12/16/504851.aspx


A few days ago, David Anson (a developer at Microsoft who I swear I saw here on an occasional email as long ago as late 1998? Maybe I am mis-remembering) sent me the following piece of mail:

I know you blog on funky Unicode characters, but MSN Search didn't find a blog on this guy yet. I've got a text analysis tool I maintain that I've just patched with the following comment:

// Per http://www.unicode.org/charts/PDF/U0080.pdf, character \u0084 is
// a C1 Control (IND/Index) and causes rendering of text it is in to
// end prematurely. It needs to be substituted with a harmless
// replacement in order to avoid partial line rendering. Testing with
// "All char values.txt" suggests this is the only such character.

What's up with that?!? :)

Now David also mentioned that his tool was a managed one, although after looking into this a bit I might have guessed anyway (I will explain why in a moment).

Unicode of course has little to say about either the C0 or the C1 control character, since it inherited them from the ISO-8859 standards, each of which include these characters.

Now the ISO-8859-* standards (per the ISO site,m they each seem to cost 64.00 CHF -- 49.41 USD by today's fix -- so no I am not going to go purchase all 15 of the ones that are available) are probably not worth purchasing anyway if you wanted more information on them -- they do not have any, really.

Perhaps the ISO-6429 standard (Control functions for coded character sets) would help for most characters, but don't waste the 188.00 CHF (145.72 USD by today's fix) because ISO-6429 includes the names for all of the control characters with the exception of U+0080, U+0081, U+0084, and U+0099. So there is no help there, either.

Finally, with some time pounding the Internet beat a bit I did find several of the IND and Index entries (places like decodeunicode.org), but none of them really explained even a mesning for the name, let alone a source.

So I took a step back and thought about David's claim -- that it seemed to act in some way to break the line in a Microsoft product -- and took a look at some Microsoft products. :-)

I used the string of characters U+0080 - U+0099 (the C1 control characters): "€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™"

In the textbox into which I am typing right now, U+0085 is coming up as a blank and the rest come up as NULL GLYPHS -- there is definitely no line break at U+0084.

In notepad, it is all NULL GLYPHS.

I'll try Wordpad -- that looks like my earlier textbox experience.

How about Word 2003? Same thing.

Ok, it might be time to give up on hints here, time to look at some source....

(a bit of time passes as I am looking)

Ok, I found something!

There is a collection of characters in GDI+ (of which U+0084 is a part) that if a linebreak is needed, it is considered a linebreak opportunity. It even has a name for the code -- WCH_LINEBREAK. And an alternate name -- wchEndLineInPara.

Now the GDI+ behavior and the code seems a bit too intentional to be accidental. So although I am still not sure about the source, I am sure I have written enough to call it a character story! :-)

 

This post brought to you by "„" (U+0084, a.k.a. any of the above names depending on your preference)


# Michael S. Kaplan on 16 Dec 2005 7:11 PM:

David has pointed out to me that behavior in GDI+ seems to be a lot more of an actual linebreak than a 'linebreaking opportunity'....

Sounds like a reasonable correction under the circumstances. :-)

# AndyM on 19 Dec 2005 8:53 AM:

If the ISO stnadrds are a little pricey for casual use, ECMA International make their versions of the standards available for free online:

http://www.ecma-international.org/publications/standards/Standard.htm

# John Cowan on 12 Mar 2006 6:46 PM:

The main point of U+0084 and U+0085 is to discriminate between the two historic uses of U+000A, line feed and new line respectively.   U+0084 was intended unambiguously as the line feed function, whereas U+0085 was intended unambiguously as the new line function.  Of course it didn't help.  (This is not a Unicode issue, actually.)

go to newer or older post, or back to index or month or day