if you see a ZWNBSP in the Release Preview, don't be insensitive and comment it hasn't been eating enough lately!

by Michael S. Kaplan, published on 2012/07/16 16:04 +02:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/07/16/10330046.aspx

Just yesterday, I was asked:

Hi Michael,

I'm running the Release Preview, and I found something confusing.

*SP* has weight, *ZWSP* has weight, and *NBSP* has weight, but *ZWNBSP* has no weight!

Is this a bug? Is it too late to fix before Windows 8 comes out?

Funny he should say that!

This was actually an intentional change that happened in the new version sorting in Windows 8, as of just after the Windows 8 Consumer Preview and a little time before the Windows 8 Release Preview.

This change directly undoes the change described in If a bunch of specific Unicode characters can no longer live in the same apartment together, can they really claim that they needed their space?.

It works like this:

The space character, U+0020, is given weight in the collation table.


Perhaps more to the point, U+200b (ZERO WIDTH SPACE) and U+00a0 (NO-BREAK SPACE) have weight.

And for the last four versions of Windows:

and up to and including the Windows 8 Developer Preview and the Windows 8 Consumer Preview, U+feff (ZERO WIDTH NO-BREAK SPACE) has had weight too.

With a new major sorting version starting the Windows 8 Release Preview, this single inconsistency has been added once again, and technically made inconsistent....

As a bonus, we are also once again inconsistent with the Unicode Collation Algorithm, but I've been telling people for years that Microsoft does not use the Unicode Collation Algorithm.

If you want to make the other spaces also have no weight, then as I pointed out in  I need my SPACE, symbolically speaking, the weight these characters are given is in the symbol range. So if you truly want to ignore the others, you can just call CompareString or CompareStringEx with the NORM_IGNORESYMBOLS flag. And you can go from there....

Now of course some people will cry out that U+feff is not just a space like the others, it is also the BYTE ORDER MARK (ref: Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)). Now in my humble opinion (many years and many building away from the group that owns the tables), the BOM has an even clearer semantic meaning attached to it, so ignoring it completely would not really be a linguistic or even a semantic requirement like some other characters like the ones I mention in Every character has a story #23: U+00ad (SOFT HYPHEN) and You've got to be kashidding me....

But the theoretical viewpoint that drove this change all those years ago was overruled by several important real world customer and partner scenarios.

And the "standards conformance" argument seldom holds water when we don't conform to that standard!

Second runner- up for this blog:

Reality trumps theory (almost always!)

Plus I wasn't there anymore, and neither was the linguist I was working with back then.

First runner-up for this blog, in case the title is unable to fulfill it's duties:

The people who are there trump the people who used to be there (most of the time!)

So anyway, if you see a ZERO WIDTH NO-BREAK SPACE in the Windows 8 Release Preview, don't be insensitive and comment it hasn't been eating enough lately!

comments not archived

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day