If a bunch of specific Unicode characters can no longer live in the same apartment together, can they really claim that they needed their space?

by Michael S. Kaplan, published on 2007/05/17 05:45 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/05/17/2692362.aspx

So dipaksmistry asks (in an off-topic manner in response to this post):


I have written a small programme. Its code is as below.

#include "stdafx.h"
#include <tchar.h>
#include <Shellapi.h>
#include <malloc.h>
#include <stdlib.h>

int APIENTRY WinMain(HINSTANCE hInstance,
                    HINSTANCE hPrevInstance,
                    LPSTR     lpCmdLine,
                    int       nCmdShow) {
    TCHAR t3[15];

    t3[0] = 65279;     // equivalent to 0xFEFF,   (in my real application this byte is read from a unicode file in UTF-16 format and its value is 0xFEFF)
    t3[1] =_T('a'); 
    t3[2] =_T('b'); 
    t3[3] =_T('c'); 
    t3[4] =_T('d'); 
    t3[5] =_T('e'); 
    t3[6] =_T('\0'); 

    if ((lstrcmpi (t3,_T("abcde"))) == 0) {
        MessageBox(NULL,_T("two strings are equal"), _T("match result"), MB_OK);
    } else {
        MessageBox(NULL,_T("NOT EQUAL"), _T("match result"), MB_OK);

    return 1;

I have build it and made a executable.

Not when I run this executable on XP machine(with SP2 installed), it gives messages one.("two strings are equal").

string comparison passes on XP machine.

But when I run this application on Vista machine it gives second message.("NOT EQUAL")

string comparison fails on Vista machine.

Do anybody have any idea, why this is happening....?

This was actually an intentional change that happened in the new version sorting.

It works like this:

The space character, U+0020, is given weight in the collation table.


Perhaps more to the point, U+200b (ZERO WIDTH SPACE) and U+00a0 (NO-BREAK SPACE) have weight. Which made the fact that U+feff (ZERO WIDTH NO-BREAK SPACE) was not just a little bit inconsistent...

With a new major sorting version in Vista, this one inconsistency was removed and made consistent....

As I pointed out in  I need my SPACE, symbolically speaking, the weight these characters are given is in the symbol range, so if you truly want to ignore U+feff or any of them, you can just call CompareString or CompareStringEx with the NORM_IGNORESYMBOLS flag. And you can go from there....

Now of course some people will cry out that U+feff is not just a space like the others, it is also the BYTE ORDER MARK (ref: Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)). But the BOM has an even clearer semantic meaning attached to it, so ignoring it completely would not really be a linguistic or even a semantic requirement like some other characters like the ones I mention in Every character has a story #23: U+00ad (SOFT HYPHEN) and You've got to be kashidding me....

Getting back to the title of this post, if looks like in Vista that U+feff finally can feel like it is not being ignored, so it won't need to move out claiming that it needs its space. :-)


This post brought to you by U+feff, a.k.a. ZERO WIDTH NO-BREAK SPACE

# Mihai on 17 May 2007 2:23 PM:

<<Now of course some people will cry out that U+feff is not just a space like the others, it is also the BYTE ORDER MARK>>

And I would say they are wrong :-)

BOM is to be used if and only if the endianess of the data stream is unknown.

Which is not at all the case with the Windows API.

What if Windows moves to Power PC? :-)

No problem. Because the endianess of the Windows API strings is not Little-Endian, is "processor-endianess."

It so just happens that this is LE on Intel processors.

# dipaksmistry on 18 May 2007 1:50 AM:

Hi Michael ,

Thank you very much for helping me!

Best Regards,

Dipak M

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2012/07/16 if you see a ZWNBSP in the Release Preview, don't be insensitive and comment it hasn't been eating enough lately!

go to newer or older post, or back to index or month or day