Inconsistencies aren't as important when they're irrelevant

by Michael S. Kaplan, published on 2012/09/11 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/09/11/10348081.aspx


If you look at Unicode as it is today, it is a hugely complex standard that defines more than 100,000 characters and has defined complex algorithms for using, displaying, sorting, and storing them.

Though there were simpler times, too.

I mean, before the Unicode Collation Algorithm defined in UTS #10, no way to sort the character was defined.

Every company had their own way to do it themselves -- it's not like Sybase or Oracle or IBM or Microsoft was going to build databases one could query via SQL without defining something useful for the ORDER BY clause to do, after all!

Some of those companies picked up UTS #10 as an option for collations.

And some companies that came along later vchose to use it once it was there.

But when one considers the fact that every character needs a position in the DUCET so it can have a place to go in the order of characfers, there are two kinds or characters to consider:

The first category is easy -- as huge push in the UTS definition to decide the order.

The second category is obviously a bit more complicated -- every set of characters proposed gives some suggestions about the collation of them, and the UTC places them all somewhere bases on that feedback, their expertise, and their knowledge of how the UAX and Unicode work.

Bugs are sometimes found; they are fixed in future versions (after they have been identified).

Old versions are left alone; no wants to break existing behavior, or database indexes....

Now the entire Unicode Standard works that way.

Even the Standard Annexes, like the Unicode Bidirectional Algorithm defined in UAX #9.

Now ordering is needed in the UCA, even when it makes no real sense -- like ordering Emoji or other symbols.

And the UBA has such cases too.

Because if you want to have a formal standard that defines how everything in Unicode behaves in bidirectional contexts, you have to also include that special category of characters.

I refer, of course, to the set of characters that no reasonable human ever expects to use in bidirectional contexts!

Enter these two characters:

These two characters, these two symbols, have the same properties as 0028 and 0029 -- and every other bracket pair that exists in Unicode.

But the Bidi "mirroring" property is not defined for them like it is for every other pair of brackets.

So they never mirror.

Why would they? Their purpose in the standard is for a legacy character in an East Asian standard,and a place where Bidi is for most of the users and potential users, irrelevant to consider for Bidi.

Eventually, this issue was discovered, but it was discovered too late.

The attempt to fix it led to problems in actual usage.

No wanted wanted to break existing usage that way.

And thus, a permanent exception was born.

Every once in a while, someone notices the problem again.

It happened just the other day, in fact.

And the issue was explained yet again. :-)

How bad the problem is depends on how to look at Unicode, and which you think is more important -- the intuitive global behavior of the algorithms, or the realization that when a scenario is not relevant you don't care so much about leaving a case alone....


Nick on 11 Sep 2012 7:56 AM:

So if I made a ":﴿" with the ornate parenthesis, bidi could turn it into a ":﴾"?

Michael S. Kaplan on 11 Sep 2012 9:10 AM:

Actually, these are the only two parens that *won't* ever mirror!

Joshua on 11 Sep 2012 9:33 AM:

Exactly. It would turn into ): instead of (:. Now why does database collation have to care about that one?

Michael S. Kaplan on 11 Sep 2012 5:33 PM:

Database collation doesn't. But if you use ORNATE PAR$ENTHESES in your ASCII art, then any pain you feel would be self-inflicted!


go to newer or older post, or back to index or month or day