Sometimes, we don't break for spaces...

by Michael S. Kaplan, published on 2012/05/10 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/05/10/10303623.aspx


This blog today is about a character in Unicode.

U+00a0, aka NO-BREAK SPACE, specifically.

U+00a0, aka NO-BREAK SPACE, in its Code Chart view

I could have made it an Every Character Has a Story blog, almost.

Except it is really going to be about locales on Microsoft platforms, rather than a history and/or story of the character itself.

So I won't talk about the suggestion to Sri Lanka to use it in their Standards, or the role Unicode has it play in lone combining characters, or any of the other interesting stories about it.

Sorry!

To start, there is a regular space, which allows anyone rendering text to treat it opportunistically as a line breaking opportunity.

Like if you have more characters in a line then you have line, then it will break at one of those places -- perhaps on that space!

But if you put a NO-BREAK SPACE there, then it will not be used as a line breaking opportunity -- the text on either side will act as if it is just another letter or something.

I endeavored to explain to my girlfriend what U+00a0 does, and she suggested maybe it was like how she and I were connected. That'll work. :-)

Anyhow, if you look at all of the LOCALE data in Windows, there are ~185 instances of the NO-BREAK SPACE, U+00a0.

The ~185 instances fall into two categories:

Now that second category makes sense.

If one has a month name of كانون الثاني, one may genuinely want to not let it span lines.

And so on.

The first category also makes sense -- one may want to make sure that the number $100 000 000.00 or 45 678.00 doesn't get split up either.

In fact, one may wonder about the ~9 cases that are similar to category #1 that use U+0020 for their LOCALE_STHOUSAND or LOCALE_SMONTHOUSANDSEP, right? :-)

You have to wonder if some or all of those ~9 and of the other ~214 cases that fall into category #2 usages of U+0020 are mistakes that would also be U+00a0, if they had a chance to think about it!

And then there are a few other interesting cases:

All of these cases have one thing in common.

According to docs, they insert a SPACE (LOCALE_ICURRENCY calls it a "separation") in all of these cases, even if the LOCALE_STHOUSAND or LOCALE_SMONTHOUSANDSEP have U+00a0 in them.

Obviously either the docs are wrong or the code creates formatted strings that could be broken before the line ends even if the separators clearly try to avoid this.

I don't know about you, but both ideas fail to sit very well with me, entirely.

How about you?

I'm almost afraid to try. Almost....


John Cowan on 10 May 2012 8:02 AM:

So what appears to be a gap between you is actually an unbreakable bond?  How extremely romantic!

Michael S. Kaplan on 10 May 2012 8:07 AM:

Correct, no mere space! :-)

cheong00 on 10 May 2012 6:55 PM:

I don't know... I mostly do web programming, so I need to turn them into   before displaying anyway.

cheong00 on 10 May 2012 11:01 PM:

Oops, not realizing the blog software doesn't escape it to   for me. :P

Aaron Eshbach on 11 May 2012 8:36 AM:

As someone who writes software that generates XSL Transforms, I'm more familiar with it as &‍#A0;


go to newer or older post, or back to index or month or day