Don't muck with the combining character order

by Michael S. Kaplan, published on 2006/02/21 03:11 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/21/535808.aspx


The communicative property of addition clearly does not apply to combing marks in Unicode.

Or at least it is not supposed to.

I mean, A + B is not the same as B + A, in any situation where that order is meant to enforce how they are placed in relation to each other.

Anyway, regular reader Mike Dunn asked in the Suggestion Box about an exception to this:

After reading your post about putting lots of diacritics on a letter, I wondered what determines the order that they appear in.

I looked at the sequences 0065 0302 0303 and 0065 0303 0302 using Tahoma on XPSP2 in Notepad and Word 2000, and in both cases the diacritics appear in the same order (tilde above the circumflex). This is the right order for Vietnamese, but if I were writing IPA, I would want the circumflex on top. Can the order be changed with control characters?

That, my dear Mike, is an excellent question. One that (now that you asked it) I was very curious about the answer. Why do these two sequences:

look the same, anyway?

Both U+0302 (COMBINING CIRCUMFLEX ACCENT) and U+0303 (COMBINING TILDE) have the same canonical combining class value -- 230, which means 'Above'. So there is no valid Unicode-type reason for them to re-order.

Now it is true that one character is a encoded as a precomposed sequence in Unicode and one is not, but still!

I was determined to find out what was going on.

Luckily, down the hall is the best freaking font team in the world, so all I had to do was head down the hall to ask somebody.

Hmmmm.... seems like a lot of people are out right now. I made it all the way down to Nick's office, where he was talking to Mushegh. Aha, maybe they would be able to help.

I started by apologizing to them, since although I do not consider them to be "the dregs" in any kind of quality sense, they ended up being treated as the dregs due to the distance between my office and theirs. They smiled, which I took as a good sign. And then I asked them about the above....

This is actually a known issue, It is a side effect of a bug in the way that the code was looking for precomposed forms (on the assumption that a precomposed version is more likely to look correct if it exists). The bug was causing precomposed characters with the wrong order for combining sequences to sometimes be found....

The good news is that Nick himself had checked in the fix for this bug in Vista, which now does things correctly:

It has not been backported to the prior versions of Windows, though that is the sort of thing which can of course be considered and triaged appropriately....

Now the other part of the question -- how to force the right behavior on the downlevel platforms, there were not too many ideas forthcoming.

Obviously if you are building the font you decide what precomposed characters will exist in it -- you can even have none exist and rely on the attachment points and such to build up the right character.

If you are not doing the font building yourself, you would have to find a way to break up the sequences without changing the display, which can be a real challenge (no one thought of anything offhand).

One way that I did find was putting together U+1ebd U+0302 (LATIN SMALL LETTER E WITH TILDE and COMBINING CIRCUMFLEX ACCENT), although I found it would work in some fonts (such as Segoe UI) and not so well in others (such as Tahoma). See below if you have these fonts both installed:

ẽ̂ẽ̂ẽ̂ẽ̂ẽ̂ẽ̂ẽ̂ẽ̂ẽ̂ẽ̂          ẽ̂ẽ̂ẽ̂ẽ̂ẽ̂ẽ̂ẽ̂ẽ̂ẽ̂ẽ̂

If you do not have Segoe UI installed then it will not look good, so don't bother reporting that as a bug!

So anyway, I headed back to my office and decided to perhaps not just rely on office locations to decide where I visit first -- because sometimes the best people to talk to would otherwise be dismissed as the dregs, and neither Mushegh nor Nick qualify as the dregs in my book. :-)

 

This post brought to you by "" (U+1ec5, a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE)


# Phylyp on 21 Feb 2006 4:38 AM:

Michael,
I think you mean 'commutative', not 'communicative'. Unless there's a joke in there that I'm missing :)
~ Phylyp

# Michael S. Kaplan on 21 Feb 2006 7:55 AM:

No, I meant communicative (cf: http://www.uwcsea.edu.sg/elemcurriculum/g1maths.htm )....

# Andrew West on 21 Feb 2006 9:29 AM:

"Now the other part of the question -- how to force the right behavior on the downlevel platforms, there were not too many ideas forthcoming."

On a hunch I inserted U+034F Combining Grapheme Joiner (CGJ) between the tilde and circumflex (i.e. <U+0065 U+0303 U+034F U+0302>), and on XP/Notepad with the Doulos SIL or Charis SIL font selected you get the circumflex above the tilde. Unfortunately it doesn't work with any other font on my system, although if you simply use the Vista version of Uniscribe on XP (the simplest solution), then it all works OK without any need to add the CGJ, but only with the SIL fonts and Code2000, not with any of the Microsft fonts.

# bmm6o on 21 Feb 2006 11:31 AM:

Did you leave off the smiley?  You really meant "commutative".  Just because other people are confused...

Interesting article BTW.

# Michael S. Kaplan on 21 Feb 2006 12:04 PM:

No smiley intended -- if you look at the link I put up (or tons of others like it) or search for the "communicative property of addition" you will see lots of hits.

It is also how I remember learning it growing up.

So, am I wrong? Well, maybe. But if entire scool system curricula are wrong with me then I don't feel like I am in bad company. :-)

# Johan Petersson on 21 Feb 2006 12:44 PM:

Anyone who implicitly trust what he learned in school is in bad company if you ask me.

Communicative means "inclined to communicate readily" (i.e. talkative).

Commutative means "independent of order" (e.g. order of operands).

# Maurits [MSFT] on 21 Feb 2006 12:54 PM:

communicative property is wrong
commutative propery is right

Just goes to show you, no matter what kind of an idea you may have, there's a Google search to prove that others have the same idea

# Maurits [MSFT] on 21 Feb 2006 1:02 PM:

But if you insist...

http://tinyurl.com/q27m7

# Michael S. Kaplan on 21 Feb 2006 1:46 PM:

Some people find me to be very talkative, which is perhaps why I prefer the word I do?

Not a matter of implicit trust, just pointing out that if I am wrong I don't seem to be alone, to a degree that ignoring this other usage may not be in anyone's best interest.

Though it would be nice to get back to the actual subject of the post now, no wordinista work is required here. :-)

# Maurits [MSFT] on 21 Feb 2006 2:15 PM:

Don't take offen[cs]e... we knew what you meant, we're just trying to help.

You're not to blame if your teacher was using a flawed curriculum... if enough people start using "communicative" it may very well become acceptable.

But... speaking as a math major... I've personally never heard the (A+B = B+A) property of addition described as anything but "commutative."

Just out of curiousity... what do you call these properties?

A + B = B + A: commutative/communicative property of addition
(A + B) + C = A + (B + C): associative property of addition
A * (B + C) = (A * B) + (A * C): distributive property of multiplication over addition

A = A: reflexive property of equality
A = B -> B = A: symmetric property of equality
A = B and B = C -> A = C: transitive property of equality

FWIW, my pre-algebra teacher explained the word "commutative" in terms of the word "commuter" (as in people who drive to work) so I've got the old term drummed fairly well into my head. :)

Still, as Shakespeare almost said, "a property of addition by any other name would smell as sweet..."

# Michael S. Kaplan on 21 Feb 2006 2:22 PM:

I use the same names as you for the other ones.

And I took no offense...

Plus when I did the googlefight without the quotes and with communicative first, they almost matched. :-)

# Dean Harding on 21 Feb 2006 6:09 PM:

I also notice the icon for notepad has changed direction. Oooh! How exciting :p~

# Gabe on 22 Feb 2006 2:11 AM:

Interestingly, I did a search for ["communicative property" commutative] and it seems that about a third of the people who use the term "communicative property" do so on pages that also use the proper term "commutative". This indicates to me either a typo, a spellcheck correction error, or a mishearing (like people who write "must of" instead of "must've").

I had a friend from Beachwood who thought that the measuring devices called calipers were actually "calibrators", but I doubt it was the fault of the school district. This is all probably a case of misreading or mishearing the word.

To bring this back on topic though, my standard XPsp2 browser shows an "e" with a circumflex right above it and a tilde on top for U+1ec5. Is this because Tahoma has a glyph for that codepoint or does IE know how to compose it with the sequence 0065 0302 0303?

# Michael S. Kaplan on 22 Feb 2006 7:19 AM:

Hi Gabe -- that is very odd (its definitely not what I see for Tamone in IE). What version of Tahoma do you have?

# Gabe on 22 Feb 2006 4:42 PM:

My version of Tahoma is 3.14. To clarify, my version of U+1ec5 looks just like the one shown in the graphic at http://www.fileformat.info/info/unicode/char/1ec5

It's not that I don't get what I expect, I'm just wondering how it gets to be that way.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day