Every character has a story #23: U+00ad (SOFT HYPHEN)

by Michael S. Kaplan, published on 2006/09/02 13:34 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/09/02/736881.aspx


Last night I was upon the stair
A little hyphen that wasn't there.
It wasn't there again today;
Oh how I wish he'd go away!

The SOFT HYPHEN has a long if not entirely distinguished history.

It starts back in ISO 8859-1, which puts it at 0xAD, and in a rare exception to the usual practice of not explaining semantics of the encoded characters, it spends a bit of time talking about the soft hyphen, saying:

A graphic character that is imaged by a graphic symbol identical with, or similar to, that representing hyphen, for use when a line break has been established within a word.

As you can see, we are already in trouble here. It is a graphical character with a visible definhed glyph that is usually invisible and which impacts line break, a formatting operation.

And of course beyond the sloppiness in the definition there is the fact that it is usually unreasonable to assume that a person would type in this character explicitly. Clearly it is a better answer to have per language dictionaries  that contain hyphenation rules in them, as the SOFT HYPHEN "do it yourself" principles are simply not going to work in practice.

The HTML 4.0 spec has its own content on the soft hyphen. In section 9.3.3. (Hyphenation) the following text is provided:

In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur.

Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.

In HTML, the plain hyphen is represented by the "-" character (- or -). The soft hyphen is represented by the character entity reference ­ (­ or ­)

Ok, so once again we have a graphic character that is actually a formatting character tied up with line breaking. And the text seems pretty ambivalent about how a browser might be expected to interpret the soft hyphen -- clearly it is not some terrible sin if it does not do special line breaking behavior.

Doesn't ­ sound like perfect character entity reference for a character that may or may not be visible and which, even if visible, should be ignore for searching and sorting? The little bugger even sounds shy!

Now if you look at the ECMA 94 standard, available online for free here, it does have a wording that is almost the same as 8859-1's, and the difference may be striking to some but it did not impact me as much. Perhaps it iois edging more towards the formatting role of the character....

At this point, before I jump into Unicode, I'll mention that the other day Tihiy asked in the Suggestion Box:

Can you explain why Charmap refuses to display characters with 0xAD code?

Well, given the rules surrounding SOFT HYPHEN and the fact that it is impossible for any character to break a line when it is displayed in the single line text control in the Windows Character Map, it is obvious why the simple program that builds thae grid can display it even if it will not appear in the textbox below:

Of course this does not mean that it isn't there -- the SOFT HYPHEN, if included in the text stream, will be there even if it is usually invisible and ignored.

Now I say usually because in operations like collation on Windows, the SOFT HYPHEN will be ignored (it is given no weight) but it will also break compressions. In practice this shoul not matter since one should never break a word in the middle of a compression, and in fact this trick could be used to force collation to work right in cases like this one in Hungarian, though I'd recommend against it since you would probably not want to break a line in the middle of a word even in that case....

So, what does Unicode say?

In 2.0, it said:

U+00AD soft hyphen indicates a hyphenation point, where a line-break is preferred when a word is to be hyphenated. Depending on the script, the visible rendering of this character when a line break occurs may differ (for example, in some scripts it is rendered as a hyphen -, while in others it may be invisible).

In other places, the soft hyphen is described as a "discretionary hyphen", which clearly suggests the formatting role as well. It is becoming less and less of a graphic character all the time!

Unicode 4.0, after extensive discussion and review, made the switch for good, and the following two points are called out in the "changes for 4.0" text, with the following two bullet points:

And the text in UAX#14: Line Breaking Properties points out yet another issue that people may not have considered, buried in section 5.3 Use of Soft Hyphen:

The action of a hyphenation algorithm is equivalent to the insertion of a SHY. However, when a word contains an explicit SHY it is customarily treated as overriding the action of the hyphenator for that word.

Every time I think about the issue, I am reminded of this case, whether an attempt to optimize actually inhibited other optimization attempts. Yet another reason to avoid the soft hyphen? :-)

There is other trivia, like it is removed by nameprep for IDN, Apple's ATSUI does not support it in version 1.1 and later, and Microsoft Typography talks about it a bit as well.

Now after the 4.0 change, Markus Kuhn wrote a strongly worded dissenting opinion on the change, which Ken Whistler responded to:

I believe the recent "clarification" of the semantics of the SOFT HYPHEN (U+00AD) character in Unicode 4.0 had an unfortunate outcome. In particular, changing its class from Pd to Cf in UnicodeData.txt breaks backwards compatibility with how this character was widely used in ISO 8859-1 terminals for the past 15 years and causes now headaches with the designers of VT100-style terminal emulators with ISO 8859-1 and UTF-8 support. 

This may well be the case. I don't have any particular iron in this fire, since I was neither in the camp advocating for this change nor was I particularly set up to argue against making the change.

But the fact that this issue came up, was argued at length, was put up as a public issue for an extended period of time, and then argued some more before it was decided, indicates to me that the status of U+00AD SOFT HYPHEN as a gc=Pd character was causing other people headaches as it stood.


As Unicode claims for U+0000 to U+00FF to be compatible with ISO 8859-1, it should also respect the intended and de-facto use of ISO 8859-1 characters and should not change their semantics over a decade later.

The establishment of Unicode character properties for Unicode characters does not, ex post facto, change the semantics *of* ISO 8859-1. If that were the case, then any number of character property assignments (including compatibility and canonical decomposition mappings), and character property assignment *changes*, such as those for U+00B7 MIDDLE DOT, could be equally attacked as ex post facto changes to ISO 8859-1.

But the additional character behavior specified by the Unicode Standard does not impose constraints back onto standards that those characters map to -- including ASCII and ISO 8859-1. Nothing that the Unicode Standard says about *Unicode* characters can suddenly make a conformant ISO 8859-1 implementation nonconformant in the way it handles characters.

The issue, instead, is interoperability for implementations of Unicode that map back and forth to implementations of 8-bit character encodings (or others), including ISO 8859-1. And I suspect, in the case of SOFT HYPHEN, that the problem we are facing is really that SOFT HYPHEN has had a long history of legacy implementations in two (or more) incompatible ways.

Certainly the terminal display protocols that insert line-ending SOFT HYPHENS as graphic characters which can be stripped back out when presentation text is restored to content text has a long history. But the other model also way predates the examples you cited, going at least as far back as WordStar's internal use of nondisplaying soft hyphen characters as line break opportunities that only displayed visibly (with a hyphen) at actual line breaks. For WordStar it was 0x1E for 'inactive soft hyphen', which was an inserted line break opportunity for word-wrap, and 0x1F for 'active soft hyphen', which was an actually broken word for word-wrap, displayed (and printed) visibly. (WordStar *predates* ISO 8859-1, by the way, since it was first released in 1979.)


As discussed in detail for example on

 
http://www.cs.tut.fi/~jkorpela/shy.html

the ISO 8859-1 standard defines, in section 6.3.3 the SOFT HYPHEN as "[a] graphic character that is imaged by a graphic symbol identical > with, or similar to, that representing hyphen".

The ISO 8859-1 standard uses unfortunately only the rather unclear words "for use when a line break has been established within a word" as the complete definition of the intended usage of this character. This clearly falls short completely of setting up a document processing model and defining unambiguously what role SOFT HYPHEN plays it its various phases and functions.


Yep. And that has contributed to the confusion for years. It didn't help that 8859-1 didn't image 0xAD SOFT HYPHEN with a hyphen glyph in the chart, but instead with a "SHY" acronym, implying that it was, in fact, a "funny" character that might not always display visibly. That, plus the less than clear wording in the note on SOFT HYPHEN (now in Clause 5.3.3 in 8859-1) was symptomatic of the aversion of SC2 standards to define "character processing" behavior, but also reflected, I suspect, a deliberate willingness to allow for inconsistent processing models. It isn't much of a stretch to interpret the wording in Clause 5.3.3 as:

   "A graphic character that [when imaged] is imaged by a graphic symbol identical with, or similar to, that representing HYPHEN, ..."
   
which opens the door to the Word/WordPerfect etc. style interpretation.


The definition "graphic character that is imaged by a graphic symbol identical with, or similar to, that representing hyphen" made it clear to users familiar with the above mentioned problem that the SOFT HYPHEN is just an alternative of the normal graphical character HYPHEN, for use when a hyphen is inserted by a line formatting routine.

I don't think it was quite so clear as that.

[ snip HTML discussion ]

This HTML 4 reinterpretation is essentially the semantics that Unicode then adopted as well.
 
Nevertheless, there is a vast number of VT100 terminal emulators, printers, and similar 8-bit output devices out there that treat the SOFT HYPHEN as a full graphical character, as had been suggested by ISO 8859-1


The problem with this is that is assumes that "graphical character" is well-defined and never involves ambiguities of display for SC2 standards.

If you look at ISO 8859-8 (Hebrew), when it was revised to make the use of LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK part of the standard, so that implicit order bidi with 8859-8 was well-defined, those characters were *also* described as "graphic characters", cloning the wording right out of the longstanding and traditional, if somewhat bizarre wording used to describe the SPACE character. For LEFT-TO-RIGHT MARK:

"A graphic character the visual representation of which consists of the absence of a graphic symbol, which acts like a left-to-right character in a bidirectional context..."

If, for an 8859 standard, a character which *never* has a visible display glyph (except for charts or "Show Hidden" contexts) can be considered to be a "graphic character", you can see why the situation for SOFT HYPHEN can be considered less than
clear.


and by the old application need to distinguish between content and hyphenation hyphens in formatted presentation data streams.

It is used today by a number of UTF-8 terminal applications to decide, by how many character cell positions the cursor will advance if the Unicode character provided as an argument is sent to the terminal. The rules for generating its semantics from Unicode tables are very simple and include the rule

  - Other format characters (general category code Cf in the Unicode database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.

With the change of SOFT HYPHEN from general category code Pd to Cf in the Unicode 4.0 database, this causes now terminal behaviour to change from wcwidth(0x00ad) = 1 to wcwidth(0x00ad) = 0. In other words, what used to be a spacing graphical character in accordance with ISO 8859-1 that always advances the cursor by one cell after printing the glyph of a hyphen is not an ignoreable and usually invisible format character.


It seems like the obvious fix here is to exempt U+00AD from the generic class treatment of Cf characters, both in the stated documentation and in the implementation. In other words, for the purposes of those UTF-8 terminal applications, SOFT HYPHEN is not an "other format character", but is an exception that should go on behaving exactly as you have it currently defined.

By the way, implementers cannot, now, assume that gc=Cf characters (format controls) should *always* be invisibly displayed. The addition of the various Arabic prepositive numeric accumulators (U+0600 ARABIC NUMBER SIGN and the like) have added a subclass of format controls which *do* have visible display glyphs. And there is now a separate Unicode character property, Default_Ignorable_Code_Point, which should also be taken into account when deciding whether a particular character, by default, should be displayed with a zero glyph or a black box glyph, for example, if uninterpreted.


In this sense, Unicode 4.0 breaks with the well-established tradition of interpreting the SOFT HYPHEN as a graphical character in output devices.

It would have been nice, if Unicode hadn't done that. Unicode could instead have chosen to add a new ignorable formatting character for marking possible hyphenation points in documents, which could be called for instance HYPHENATION POINT. A formatting function can then either discard a HYPHENATION POINT (if it ended up inside a formatted line), or convert it into the graphical SOFT HYPHEN character, where the hyphenation point ended up at the end of a line in the presentation data stream.


This possible approach was also debated, but was rejected. The opponents of that approach can speak for themselves, but if I recall, this approach would, itself, have had at least as many legacy compatibility issues.

This would have preserved backwards compatibility with the zillions of ISO 8859-1 output devices out there that treat SOFT HYPHEN as a graphical character.

What shall I now do as the implementor of an ISO 8859-1 terminal emulator when I receive a SOFT HYPHEN?


Exactly what you are currently doing.

Will the next edition of ISO 8859 be changed, to remove the definition of the SOFT HYPHEN as a graphical character?

Of course not. It will stay exactly as it is.

The ambiguity in "graphic character" will say unchanged in the SC2 standards. Note that 10646 itself keeps the traditional SC2 definition:

  A character, other than a control function, that has a visual representation normally handwritten, printed, or displayed.
 
but then proceeds to encode a whole host of space characters and format control characters which normally *don't* have a visual representation. These are then swept under the rug with the same logical nicety used for SPACE:

  "A graphic character the visual representation of which consists of the absence of a graphic symbol."
  
Uh, huh. O.k., well, then... ;-)

  
Or, my preferred outcome, do you agree that all this SOFT HYPHEN = Cf revision was probably a mistake and we should undo everything quickly in the next revision?

I think that is most unlikely at this point. The issue for SOFT HYPHEN was up for public review for rather a long time. The decision was not hurried for it, but extended through a number of UTC meetings, precisely because people were worried about compatibility and legacy issues. But I don't think the issue should be reopened and redecided differently. The only thing worse than a poor decision by a standards committee is waffling about decisions by a standards committee.

And in this case, I don't really see why you cannot keep on doing what you are currently doing for the UTF-8 terminal emulations. If you document how U+00AD behaves in those emulations, and you should be fine.

And after that. Michel Suignard pointed out an issue that had been overlooked by some, which was 10646 stepping up!

Note that unusally, the latest text from ISO 10646 both in the
10646-1:2000 2nd amendment and the consolidated version capture
verbosely the latest view on this as follows:

SOFT HYPHEN (00AD): SOFT HYPHEN (SHY) is a format character that indicates a preferred intra-word linebreak opportunity. If the line is broken at that point, then whatever mechanism is appropriate for intra-word line-breaks should be invoked, just as if the line break had been triggered by another mechanism, such as a dictionary lookup. Depending on the language and the word, that may produce different visible results, such as:

The inserted graphic symbol, if any, can take a wide variety of shapes, such as HYPHEN (2010), ARMENIAN HYPHEN (058A), MONGOLIAN TODO SOFT HYPHEN (1806), as appropriate for the situation. When encoding text that includes explicit line breaking opportunities, including actual hyphenations, characters such as HYPHEN, ARMENIAN
HYPHEN, and MONGOLIAN TODO SOFT HYPHEN may be used, depending on the language.

When a SOFT HYPHEN is used to represent a possible hyphenation point, the character representation is that of the text sequence without hyphenation (for example: "tug<00AD>gumi"). When encoding text that includes hard line breaks, including actual hyphenations, the character representation of the text sequence must reflect the changes due to hyphenation (for example: "tugg<2010>" / "gumi").

This was discussed at length during the UTC and WG2.

And Kent Karlsson also pointed out some facts that had been ignored by Markus and others who were stating their opinions:

That text is unfortunately too easy to misread (and overinterpret!). Having talked to one of the authors, he was very surprised at the Kuhn/Korpela interpretation. I think it is a case of being too close to a text to have seen how it could be misread by others.  Kuhn's interpretation was definitely not intended (and very few interpret it that way).

The intent of that text, that you partially quoted, is that SOFT HYPHEN is graphic (and imaged) WHEN an (automatic) line break has been made (while it is otherwise invisible, which was not clearly stated). Unicode, SC2/WG2, SC2/WG3, IBM, MS, Adobe, and many others agree on that. Whether it is visible just before an explicit line break (e.g. an LF), is still not clearly stated (though in practice it is, by the already deployed software that does suppport SHY).

What IS new with Unicode 4.0 is that **when imaged** the SOFT HYPHEN may take any suitable hyphen shape (to be nitpicking, it best not to see the SOFT HYPHEN as ever being imaged (except in a "show invisibles" mode, it is just a hyphenation point indication, and the hyphen being imaged when there is a line break is not the actual SHY character).  E.g., in Mongolian texts, it should be imaged as a MONGOLIAN TODO SOFT HYPHEN (the "soft" in that name has been decided to be a mistake). In an Armenian text, SOFT HYPHEN *when imaged* takes on the shape of an ARMENIAN HYPHEN.

Some may also remember my Not all GetUnicodeCategory methods are created equal post, which clearly notes the fact that there was some managed code that was depending on the old categorization of SOFT HYPHEN....

All in all, it makes for a fascinating story. :-)

 

This post brought to you by U+00ad, a.k.a. SOFT HYPHEN


# orcmid on 2 Sep 2006 7:23 PM:

Well, that is interesting.  Thinking about ISO 8859-1, and the location of the character right in the middle of what are thought of as printable characters, and pretending I'd never heard of this before, or known that &shy actually maps to it (bummer), I would have thought:

- the soft hyphen indicates where a hyphen has been introduced as part of breaking a word for line justification purposes.

That is, if I wanted to reassemble the text, this is a hyphen that can be removed and the word put together with its continuation on the following line.

There might be some reason to remember the point(s) where hyphenation is allowed, and how that gets done in an interchange situation is certainly an interesting case.

Which just goes to show that decontextualized absolute definitions of these kinds of thingies are perilous and generally unsatisfying because the same code could be used in different ways for formatted, unformatted, and private (e.g., pre-formatting) use of the same character code.

I like the HTML 4.0 definition for &shy; (that's &amp;shy; depending on the filtering of entity references by the comment processor here) just fine, and the instruction to the renderer to recognize &#xAD (ahem, and what character code is that presumed to be in?) in the same way seems clean enough -- this is an application agreement (i.e., arguably a pre-formatting case).  I wonder what our favorite browsers do with it in a <pre> element, though.

# RubenP on 4 Sep 2006 4:23 PM:

Ignoring the technicalities of the soft hyphen's graphicality...

- it's a life saver when there's no automatic hyphenation algorithm in place. Such as in HTML when you can expect that a certain word *might* need to be hyphenated to produce a proper layout;
- it's a life saver when there *is* an automated hyphenation algorithm in place, to distinguish between re-cord and rec-ord when the word record *might* be hyphenated.

[Moving slightly off topic...]

Unfortunately, the soft hyphen is a little too simple for many languages, with German Zucker - Zuk-ker (at least before the Rechtschreibung IIRC) and Dutch baby'tje - baby-tje, cafeetje - café-tje, geëerd - ge-eerd (both old and current spelling). Real hyphenation algorithms and exception tables are required here. Though most hyphenation algorithms do *not* like this.

Word does get Dutch hyphenation with a trema/dierersis right most of the time, which is better than most programs can claim. (I know the software used by one of the news papers I read can't, creating spelling horrors like ge-ëerd.) And to make it more bullet-proof, Word cleverly refuses to hyphenate the other varieties at the 'odd' location, so it's always at the very most ca-feetje, and never cafee-tje or cafeet-je, but no café-tje either. (Unfortunately, souveniertje is incorrectly hyphenated as souvenier-tje; which should obviously be souvenir-tje. Well duh!)

Let's hope WPF uses the same rules as Word does.

# xxx on 22 Nov 2006 3:37 AM:

zzz  I have Javascript OFF, so how can this work ?

# Michael S. Kaplan on 22 Nov 2006 8:51 AM:

How can what work?


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2012/07/16 if you see a ZWNBSP in the Release Preview, don't be insensitive and comment it hasn't been eating enough lately!

2010/06/11 Call it Reversible Error, aka Yes it has no weight; it was supposed to have no weight!

2007/05/17 If a bunch of specific Unicode characters can no longer live in the same apartment together, can they really claim that they needed their space?

2007/01/12 You've got to be kashidding me

go to newer or older post, or back to index or month or day