Every character process has a story, too. And some are better than others....

by Michael S. Kaplan, published on 2011/11/18 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/11/18/10238393.aspx


As the Unicode 6.1 beta marched along happily, Peter Constable noticed and asked the following:

In UnicodeData.txt (both 6.0 and 6.1 beta), why is the bidi category of 1F48C LOVE LETTER set to L rather than ON? By design, or (I'm guessing) a bug?

And Ken Whistler jumped in with the answer and explanation:

It is an artifact of the heuristic which is used to assign initial values to the 1000+ new entries typically appearing for a new version of UnicodeData.txt for a release. That heuristic guesses that a character with "LETTER" in its name is a letter, and assigns initial Bidi_Class properties accordingly, before I go through attempting to find all the exceptions manually and correcting them.

Apparently both I and everybody else missed this during the beta review for Unicode 6.0. This clearly is a bug, and should be fixed for 6.1. I'd suggest dropping a short note in the hopper as feedback on PRI #206, so we don't lose track of this and remember to get it fixed along with anything else that turns up in the data files.

BTW, when you report that one, there is another with the exact same problem:

U+1F524 INPUT SYMBOL FOR LATIN *LETTER*S

which is also bc=L, instead of the expected bc=ON.

Cf.

U+1F520 INPUT SYMBOL FOR LATIN CAPITAL LETTERS

which *did* get corrected, and is the expected bc=ON.

--Ken

Now obviously, the heuristic Ken refers to here could easily be improved.

For example, if it says LETTER with nothing after it, then maybe it's a love letter, versus an actual letter.

And again, for example, if it has the word SYMBOL in it, then perhaps that would override it having the word LETTER in it.

And so on.

You get the idea....


no comments

go to newer or older post, or back to index or month or day