Stealth features (like language detection?)

by Michael S. Kaplan, published on 2007/01/28 09:31 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/01/28/1547147.aspx

Under discussion today are eight characters made famous in several prior posts:

ş (U+015f, LATIN SMALL LETTER S WITH CEDILLA)
Ş (U+015e, LATIN CAPITAL LETTER S WITH CEDILLA)
ţ (U+0163, LATIN SMALL LETTER T WITH CEDILLA)
Ţ (U+0162, LATIN CAPITAL LETTER T WITH CEDILLA)
ș (U+0219, LATIN SMALL LETTER S WITH COMMA BELOW)
Ș (U+0218, LATIN CAPITAL LETTER S WITH COMMA BELOW)
ț (U+021b, LATIN SMALL LETTER T WITH COMMA BELOW)
Ț (U+021a, LATIN CAPITAL LETTER T WITH COMMA BELOW)

Now obviously it takes an explicit act of typography to add letters to a font if they are not there. That is obvious and not at all what this post is about.

Instead, this post is about taking S/s (U+0053/U+0073) or T/t (U+0054/U+0074) and making use of them with U+0326 (COMBINING COMMA BELOW) and U+0327 (COMBINING CEDILLA).

It was back in April of last year that our good friend Cristi asked in the microsoft.public.win32.programmer.international newsgroup:

I tried to do the following Unicode character combination, using Times New Roman font:

0074 0327 (with no space between them)

where 0074 is latin small letter t and 0327 is combining cedilla below. The expected displayed result should have been character t with cedilla below, but the actual displayed character is t with comma below.

I know that in TNR font the glyph associated with U+0163 has comma below, but what has this to do (a distinct Unicode combined character) with separate base character + combining diacritical mark combination ?

HOWEVER, the strangest thing is this: if the 0074 0327 combination is preceded by (let's say) 0061 0327 (that is the displayed character a with cedilla below), the displayed t with comma below becomes t with cedilla below !

What's the mess ?

I tried this in Wordpad (WinXP) and MS Word.

Cristi

Now it isn't all that confusing if you think about it. After all, when rendering an attempt to find a glyph in the font that represents the composite (combined) form of the letter and diacritic is always attempted.

Trying to work to literally combine a diacritic to a base character at a specific defined "attachment" point a second best option.

And of course the attempt to shove the diacritic in without that knowledge a distant third that leads to problems like this one I talked about in Cyrillic that affects Bulgarian).

The behavior that Cristi reported was before most of Microsoft's shipping fonts included the newer "comma below" characters but certainly after U+0326 existed and long after the Romanians had made it clear they preferred the "comma below" form to the "cedilla below" form in their text.

The font is simply using clues to try decide which one to show and is trying to help Romanian documents look more Romanian, using surrounding clues in the text to try and find the best form to use. It is really just a somewhat sophisticated form of language detection via letter choice if you think about it.

That is actually kind of cool, in my opinion.

I really wish that this functionality existed in a callable form, but unfortunately it is not (MLang's encoding support includes a locale parameter but no sophisticated work to fill it in is used at present).

Of course this one particular occurrence of the feature is less important now in Vista where the support for the correct characters is there. But as a stealth feature that few people ever seemed to notice before, it is still pretty interesting, if you ask me. :-)

This post brought to you by ̧ (U+0327, a.k.a. COMBINING CEDILLA)

# Cristian Secară on 28 Jan 2007 11:09 AM:

> That is actually kind of cool, in my opinion.

Well, that's not so cool in my opinion, at least not for languages where their national keyboards include the language specific characters on their own separate keys.

What is the reason for the dead keys ? I can speak for my language, Romanian: their presence is useful when writting in foreign languages, not in my own *). Likewise, general use of combining diacritical marks should deserve the same concept, once my own specific characters are already there separately.

Who decides in what language I want to write something ? I know, in Word I can specify the language document, but (1) maybe I write something in Romanian but include inline examples using German phrases, or (2) maybe I am using Wordpad, for simple document tasks.

Perhaps some sort of OFF/ON switch for this "cool" option would be best.

*) The need for dead keys when writing in Romanian language appears seldom when one wants to show the pronunciation accent (like in dictionaries), or when one wnats to avoid confusions when a word may have a different senses, like cópii versus copíi (copies versus children)

Cristi

# Michael S. Kaplan on 28 Jan 2007 12:16 PM:

Well Cristi, I think you need to start by providing the real, non-contrived example where you get the wrong result. And then we can go from there about what needs to be improved in the future. :-)

# Cristian Secară on 28 Jan 2007 3:38 PM:

Ok, I will keep that in my mind and will try to provide feedback once I will find a tangible example in my universe. Until then, for me this remains probably just a mind game, something similar to the undisableable automatic feature of Notepad which may conduct to games like this one here http://blogs.msdn.com/michkap/archive/2006/06/14/631016.aspx :)

Cristi

# Michael S. Kaplan on 28 Jan 2007 4:28 PM:

Hmmm.... not really a very fair comparison in this case, if you ask me.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day