Japanese line breaking rules can be quite complex

by Michael S. Kaplan, published on 2007/10/29 09:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/29/5754348.aspx


A very long time ago, John Black asked in what had been the oldest post in the Suggestion Box:

I've noticed when developing web pages that use Japanese characters that many web browsers treat *every* character as a potential word boundary, meaning if you try to rely on minimum-width side effects of text blocking, you end up with a lot of overly-wrapped Japanese text.

This seems to be particular to Japanese language sets, and on more than one browser, so that's why I thought I would ask here (rather than rant about a potential browser bug.)

How are word-boundary rules for different locales handled by Windows apps -- is it different from app to app, or is there some centralized property of the maps that have this info?

And then not very long ago, a slightly different question was asked on an internal list:

Hi,

I am facing this problem in one Microsoft portals Japanese locale, below is the description of the problem:-

We experience problems with kinsoku shori, where the line breaks at a specific character that shouldn't be broken.  This is also inconsistent in every browser because of personal preferences with resolution and font-size.
 
The content is being pulled from a content management system (intaglio) and coupled with the product's css styles.

Most Japanese websites I visited do not seem to have any special CSS styles within their blocks of text, so I am unclear of how they handle the issue or if they even handle it.  The available CSS styles that resolve these problems are CSS3, which every browser is yet to be compatible with.
http://www.css3.com/css-line-break/ (Please go thru this link for better understanding)

As a temporary fix, we have manually entered in white spaces in the content to introduce or force a line break, which may still be inconsistent in other browser settings.  I was hoping that you could give me a little insight on what can be done if you have seen any methods to resolve this.  We may need to have a more permanent solution in the future.

Please reply to me if you have any information/suggestions on handling Kinsuko.

Anyway, there are lots of other people noticing problems relating to Japanese text in the browser, an issue which almost gives lie to the cultural worries about lack of complexity in Japanese I talked about a bit in If you aren't adequate, I guess that means you're inadequate; if you're not complex, I suppose that means you're simple?, if you ask me....

The truth is that there is indeed much complexity that is captured in CSS, it just isn't as well understood as it could be, I think.

Michel Suignard responded to the second question:

To me it does not look like a bug, it is basically loose kinsoku which allows break inside syllable like ‘sho’. If you want to disable that you have to set kinsoku (line-break) to strict. I could not find the sequence for ‘apurikeeshono’ in the appended txt file which seems to be the kana sequence that is ‘problematic’. This is consistent with typical Japanese text processing.

And then Paul Nelson provided a sample that you could put in an .HTM page and then watch line breaking act a bit more appropriately:

<html>
<body>
<p>ビジネス アプリケーション</p>
<p>ビジネスを実行するオンラインアプリケーション</p>
<p style="line-break:strict;">ビジネスを実行するオンラインアプリケーション</p>
</body>
</html>

The full info on this subject is itself worthy of an article or a blog post probably, though I don't have the expertise to get into it without a bit of research being done first....

So consider this post to be an acknowledgment of complexity, a promise of future effort, and then if you're interested you can just stay tuned. :-)

 

This post brought to you by(U+30d3, a.k.a. KATAKANA LETTER BI)


# Joe Clark on 29 Oct 2007 1:53 PM:

Under CSS3, you mean word-break, not line-break, and please don’t put it inline on P.

http://www.w3.org/TR/css3-text/#word-break

# SDiZ on 30 Oct 2007 7:14 AM:

> ... that you could put in an .HTM page ..

Use .html, not .htm! No OS have the 8.3 constant now.

# Michael S. Kaplan on 30 Oct 2007 3:14 PM:

It probably does not matter all that much, since every browser on the planet can support either extension. :-)


go to newer or older post, or back to index or month or day