Give me a break [Char] !

by Michael S. Kaplan, published on 2006/06/21 03:01 -04:00, original URI:

Over on the Shell team, Jeff Miller is one of those very cool developers who knows how to get stuff done. And I am not just saying that because he let me fix a bug in the Shell's code (this bug, in fact1).

Anyway, he sent email to one of the internal aliases asking whether there were actually fonts that specified anything other than U+0020 for the tmBreakChar member of the TEXTMETRIC structure.

It is an interesting question. And it is true that tmBreakChar is documented as:

Specifies the value of the character that will be used to define word breaks for text justification.

which suggests that per language (or perhaps per script) that any character might be a good candidate, which given the fact that several different language keyboards put different characters on the space bar, might make some sense. Considering especially the MSKLC limitations on the space bar that was affecting Tibtan and still currently affects Khmer and a few other languages and scripts, it seems perfectly reasonable that other characters might be here for other scripts, other languages.

However, the truth is that none of the fonts that Microsoft uses appear to ever return anything other than the ordinary space at U+0020. Even if you look at languages which in theory might consider a different character as the best one to use for word breaks, in general this particular member (which Carolyn suggested is most likely coming from the usBreakChar entry in the OS/2 and Windows Metrics table:

This is the Unicode encoding of the glyph that Windows uses as the break character. The break character is used to separate words and justify text. Most fonts specify 'space' as the break character. This field cannot represent supplementary character values (codepoints greater than 0xFFFF).

Looking at the real complexities with complex script handling and how various languages and scripts have to handle word breaking, it is most likely that this particular member is not really used by most TrueType/OpenType implementations. So while it is true that some fonts may be setting it to something else, it does not seem like most fonts do (or that Windows would actually use the infomation if a different character were used!)....

Maybe it is just one of those holdovers in the days before complex scripts but at a time that people were thinking far enough ahead that there might be some other character between words? :-)

1 - If memory serves I asked Jeff just after I had posted that blog entry whether he woud mind if I poached the bug, and he responded with something clever like "Please, poach all you want -- we'll make more!". And he had good feedback during the code review, to boot! :-)


This post brought to you by " " (U+0020, a.k.a. SPACE)

# Michael S. Kaplan on 21 Jun 2006 12:29 PM:

A comment was accidentally deleted, someone asking about what CTRL+BACKSPACE was adding in Notepad -- it is U+007f (everyone's favorite backspace control character, of course!)....

# Gabe on 23 Jun 2006 4:15 AM:

I'm not sure it even makes sense to have this as a characteristic of a font. Shouldn't it be per-language or per-script?

# Andrew West on 4 Feb 2008 6:29 AM:

Googling desperately for a solution to a problem with a font I have just created, I find that all roads lead back to SiaO!

I've just created a font that, for reasons I'd best leave unexplained, has a non-blank glyph for U+0020. My problem is that when I type text using the font in Notepad everything is fine as long as I just type alphabetic letters (a..z, A..Z and space), but as soon as I type any other character (e.g. a comma) my space glyph suddenly appears overlaying the first character on the line. And in Word 2003, my space glyph always appears at the end of every line whatever I type (even blank lines). I guessed that this must be because Windows was displaying my space glyph as the "break character", but when I changed the value of usBreakChar in the OS/2 table from "32" (i.e. U+0020) to "84" (i.e. U+0054, which is blank in my font ... because "T" is blank) there is no change in behaviour, and the glyph for U+0020 continues to be used for the break character. Which I guess indicates that the actual value of usBreakChar is being ignored, and probably a hard-coded glyph index is being used :-(

The solution which I am going to try this evening is to give U+0020 a blank glyph, and put the space glyph somewhere else; then add in a GSUB table that unconditionally substitutes the blank space glyph with my actual space glyph -- just hope it works.

Or an alternative solution worth trying might be to make glyph id 3 blank but not mapped to U+0020, and instead map U+0020 to a non-blank glyph elsewhere in the font. Either way, it's a bit annoying to have to use hacks to get a non-blank space character.

# Andrew West on 4 Feb 2008 6:30 PM:

Just in case anyone is interested, the OpenType solution works (at least with apps that support OpenType features), but my alternative solution does not, as it appears from experimentation that the glyph that is mapped to U+0020 is always used as the break character, whatever its glyph ID, and regardless of the value of usBreakChar.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day