Your data will be released, unharmed

by Michael S. Kaplan, published on 2006/07/19 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/07/19/670674.aspx

What I was talking about was making sure that when comparisons are made that involved sequences like the one above that have a canonical equivalance to (and which happen to be visually indistinguishable from) another string that they will be treated as if they were equal.

But no part of that comparison operation done with the Bengali locale or culture, whether done with CompareStringEx, CompareString, CompareInfo.Compare, or any of the functions/methods that call them, will do anything to the string to modify what is put in the backing store.

Now with that said, the simple fact is that U+09dc is canonically equivalent to U+09a1 U+09bc. And because of that, it is quite possible (throughout the entire world wide web and across all the many software products out there) to run across those who follow the principle of normalizing strings to some consistent form, and doing so consistently.

Hell, there are a few people who believe so strongly in normalizing early and often that they make religious fundamentalist extremists seem mild and dull.

Even Microsoft, in cases of working within standards like IDN and XML and others, may be normalizing a string or two.

And while this can make a native speaker of a language unhappy (as it clearly has in this case), it is important to not place too much stock on this issue as a problem, because it is really not.

The issue is similar to one I often bring up related to canonical equivalence, with U+00e5 (å) being canonically equivalent to U+0061 U+030a (å). This too is an equivalence that is annoying for a native speaker of a language that considers U+00e5 to be a helluva lot more than the letter a with some shmutz on top of it -- U+00e5 is a unique letter on its own that deserves more than to be assaulted with a deadly function, right?

In the end, å can often look like å, and ড় can often look like ড়. Normalization does not destroy language through the equivalance, and neither does collation. Both technologies are simply working to make sure that no matter what is the preferred normalization form in which data is created that they will be treated as being the same. Since they can look the same, this is neither insult nor injury to the language; in fact, it frees one up to not be concerned about them not being treated as equal even if a normalizing process happens to do something to the string....

Ok, I don't get it (and I'd like to, since the SW I'm working on at work currently has problems with canonical-decomposed comparisons).

MVP Omi complains that there is a difference between U+09dc and U+09a1 U+09bc, but Unicode says they are the same? They sure look the same to me... So what is the difference, and why would a user care how that character is represented in the internal unicode representation? (How would that be different for them from any other internal representation, say whatever non-unicode encoding would exist for Bengali, like Shift-Jis exists for Japanese).

Hi legolas,

Think about the example of å -- if you speak a language that considers that to NOT be any kind of 'a' or sort like any kind of 'a' or be ever treated like any kind of 'a', then you are quite naturally unhappy with anyone who claims that it is equivalent to anything that is 'a' combined with anything else.

It is easy to see how someone can take such a claim as insulting to a language that has been around much longer than Unicode, or even computers.....

I believe Unicode Specification is wrong. I means, not properly designed.

What does unicode codes? Does it code glyphs, or does it code characters?
The old draft tell me it's coding the characters, that's why we have all those Han-Unification stuff...

But later, they introduce the surrogate "pairs" (oh, well.. they are not pairs, but also triples, quads and ..... ) as if they are coding the glyph. Of course they realize this would generate duplications.... so they introduce normalizations..
Suddently, all unicode-complaint programs have to ship a large table just to compare if two string is equal.

And then, they introduce the language tagging....

I must say unicode specification was wrong since version 1.0 when they introduce the UCS2 stuffs, limiting the total number of caharacter in the world.

The pairs and extensions are not the soluation, but an ugly hack. I believe you know this better then I.

As a CJK user, I have to support unicode anyway. It's much easier to file a bug report saying "your program does not support unicode" then filing hundreds of bug reports on each of the encoding exists in the world.

> have a canonical equivalance to (and which happen to be visually indistinguishable from) another string that they will be treated as if they were equal.

I have to say..
character X visually indistinguishable from character Y in typographical variance A, does not means they are visually indistinguishable in variance B....

with Vitnam Han glyphs including to unicode, I can see some potential problem on "CJKV" Han unification....

Ok, so re-reading all this, and the original wikipedia bug (which I hadn't read before, sorry ;-), I understand this: the unicode decomposition for these characters, when rendered, does not quite produce the glyph the composed character produces, in some fonts (on the wikipedia bug, I see the difference, here I don't), although the difference certainly looks small.

So what Omi is arguing for on wikipedia really amounts to a 'bug' in the unicode spec? I know what these decompositions are, and I certainly see the theoretical desire to use a canonical form (which may even practically allow to use memcmp to see if something is exactly equivalent?). I'm not completely sure why these decompositions exist (I would think for drawing a char when the font does not have such a char), and I do not understand why the decomposed form would be canonical over the composed form.

Sorry if this is turning into a unicode 101, if it is, let me know or maybe consider it for a future post!

Actually, the claim is non-specific as to which is "preferred" -- some platforms, like Microsoft, usually prefer normalization form C (composed) and others, like Apple, usually prefer normalization form D.

Omi is not arguing this based on the appearance as much as he is on the linguistic preference for the single composed code point that he feels should not be deconstructed. In that view, the fact that they do not always look the same is verification that they are right, not proof....

Well if U+09dc and U+09a1+U+09bc are same, then why I cannot find U+09a1+U+09bc if I search for U+09dc?

I agree that I know nothing about normalization, but as a end user, don't you think this is a simple question can be asked?

The term 'referred' has an interesting consequence which seems to have been overlooked. In this particular case:

<U+09a1, U+09bc> is both NFC and NFD.

<U+09dc> is only normalised in a non-standard normalisation. Therefore it cannot occur in text that has been normalised to a standard normalisation.

Hi Richard,

I do not use the term, so I think maybe you meant 'preferred' here? :-)

But of course not everything and not everyone normalizes, and people will definitely have their preferences. My point is that no matter what you type in, it will work -- both before normalization and even after it, if it happens.....