Be careful what you wish for (just in case it comes true!) aka When a Cedilla needs to be a Comma Below (and vice versa)

by Michael S. Kaplan, published on 2007/01/26 06:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/01/26/1535060.aspx


It's funny how sometimes I'll have a blog post on my list of posts to write that, once I get down to writing it, ends up very different than I originally imagined. And then other times, it is pretty much exactly as I had constructed in my mind, perhaps days or weeks or even months before.

This post is a much more like the former than the latter, as the post ended up being influenced by someone independently asking me a question about the topic. :-)

Martin asked me:

Windows Vista now supports four new Romanian characters through updated fonts and a new keyboard layout, see http://blogs.msdn.com/michkap/archive/2006/11/19/1104093.aspx for images and Unicode codes.

Pre-Vista, Romanians had to make do with similar looking characters from the Turkish alphabet, namely capital and lower-case s and t with cedilla, U+015e, U+015f, U+0162, U+0163. There is a vast body of Romanian documents and online content with these characters.

Now, say, a Romanian Vista users wants to search a webpage or desktop for “Brașov”, a place name. With the default Romanian Standard keyboard, they enter the string in the IE search or Windows Desktop search with the new spelling, s with comma, U+0219. However, the content they are searching was created with the old spelling, using s with cedilla, U+015f. The search will fin nothing. Technically this is by design, but for Romanians who deem these characters as two interchangeable representations of the same sound, this is a bug.

Now it is an interesting point, and one that really can give a person pause. After all, the Romanians have been objecting to use of the cedilla below characters in their language for about as long as people have been using it (and I still have to post that Every Character Has a Story post!), but even ignoring all that there is a serious legacy data issue to contend with, one that just adding it to fonts can't completely help with, no matter how cool stuff like this and this may be. Because it affects text processing as a whole.

Although with that said, the NLS Romanian collation tables on Vista will create the equivalences so that you will get the right results. Therefore:

ș (U+0219, LATIN SMALL LETTER S WITH COMMA BELOW) ≡ ş (U+015f, LATIN SMALL LETTER S WITH CEDILLA)

Ș (U+0218, LATIN CAPITAL LETTER S WITH COMMA BELOW) ≡ Ş (U+015e, LATIN CAPITAL LETTER S WITH CEDILLA)

ț (U+021b, LATIN SMALL LETTER T WITH COMMA BELOW) ≡ ţ (U+0163, LATIN SMALL LETTER T WITH CEDILLA)

Ț (U+021a, LATIN CAPITAL LETTER T WITH COMMA BELOW) ≡ Ţ (U+0162, LATIN CAPITAL LETTER T WITH CEDILLA)

In other words, people who use the Vista collation functions (CompareStringCompareStringEx, and so on) with either MAKELANGID(LANG_ROMANIAN, SUBLANG_DEFAULT) or ro-RO as appropriate will be able to get the right results here.

Of course as Martin's question points out, other Microsoft products may not fare as well if they do not call our functions or create the equivalences themselves. Which is going to lead to some confusion among customers, and really make us wish that everyone was calling us to lead to the most consistent experience.

I'll talk more about this tomorrow, and how the .NET Framework fares here....

 

This post brought to you by Ț (U+021a, a.k.a. LATIN CAPITAL LETTER T WITH COMMA BELOW)


# Cristian Secară on 27 Jan 2007 8:48 PM:

At first glance it appears that Martin is right. When searching with the "new" characters, something written previously with the "old" characters are no longer found.

But as always, things are in gray colour.

I assume Martin is on Microsoft Windows. How about using an Apple MacOS ? A document written in Romanian language on a MacOS was never view-able on Windows, because of the missing ș and ț glyphs on Windows, let alone searching. Now this is solved. Is this a bad thing on Windows ?

Many Internet pages written in Romanian language are technically written by poor technically skilled peoples, systems are set up by poor technically skilled peoples. Because of this, a popular concept is that it is better to write without diacritical marks at all, so no unexpected bizzare characters will appear (well, some poor designed giant software systems, like the classic Yahoo Mail, have played a big role in proving to us that the concept is correct). Because of this, a popular concept is that it is better to search on Internet without diacritical marks at all, so the result will match the real sites. So this too is solved, in a per-se manner.

The Romanian Academy exists, whether we consider this a good thing, or a bad thing, or an annoying thing. In modern days, the situation with the ș and ț characters was never different. The only new thing is that they now explained clearly in written that the diacritical mark under the ș and ț is "comma", not "cedilla", and this only after a special request at the time the Romanian keyboard standard has been revised. Until now, this was only assumed, not described. If this clarification have been there some 20 years ago, when the ISO-8859-2 standard was first published, then today we had no problem at all.

What Romanian characters are to be considered after all ? Those specified by the Romanian Academy ? Those generated by Windows systems prior to Vista ? Those generated by MacOS ?

The Romanian Academy should have been forced to change the cultural rules because, for now, there are a lot of Romanian websites with cedillas, because at one moment in time a technical standard was wrong written by non-Romanian peoples ? Or the other way round ?

Yes, unfortunately there will be some mess for some time for now, but the mess was already there, now it only starts a (long) process for clarifying. The solution must come from database programmers, who -- on Internet -- have found already a solution for finding words with cedillas when searching with no diacritical marks at all and vice versa. Now they only have to extend a little their algorithms.

The problem is less tehcnical, but merely a communication issue: who tells "them" to do this / that we need this / what we need in fact ? I tried an indirect contact at Google, for example, but with no success so far. I can only hope to find a better/different way to do this and/or to find the proper contact person. Or maybe there are some others like me who act in the same direction, who knows ?

Cristi

# Michael S. Kaplan on 28 Jan 2007 6:01 AM:

Hey Cristi!

Well,  just remember that the easiest way to tell who is doing it right by calling the Vista collation functions in the NLS API is to check this particular issue in Romanian. :-)

# Mar on 29 Jan 2007 12:25 AM:

Google could easily give people using google.ro the preferences option of merging S and T with comma with S and T with cedilla for searching purposes.

# Michael S. Kaplan on 29 Jan 2007 7:41 AM:

Well, I like to avoid the word "easily" when I do not have details about a technical implementation (it is easy to do this in a database conceptually, but perhaps there are technical reasons why it would be complicated to do this?).

:-)

# Cristian Secară on 29 Jan 2007 1:16 PM:

As I said, the problem is less technical (either easy or difficult or impossible, I don't want to speculate), but a relationship problem. How do "they" know about this requirement ?

Cristi

# Michael S. Kaplan on 29 Jan 2007 2:21 PM:

I can barely try to be reponsible for helping communicate issues to Microsoft (even that can be a stretch with such a huge company!). I cannot take responsibility for making sure Google is going the right thing for Romanian. Sorry about that, but there is only so much I can do....

# Cristian Secară on 29 Jan 2007 5:23 PM:

Well, what can I say – thank you for that!

Frankly speaking, I wish that this kind of issues should have been communicated/solved via official channels between MS Romania and MS headquarter, but unfortunately, over time, I didn't observed much (if any) preoccupation for the kind of issues we are talking here ...

On the other hand, as a personal opinion, a subsidiary has probably little initiative on its own, which I can understand to a certain degree.

Cristi

# Luci Şandor on 29 Jan 2007 11:54 PM:

Totally cool. I am amazed by the complexity of these mechanisms. One would not be able to tell how hard was to implement these things in the first place, until you see how hard is to change them. I am going to ask a refund, after so many years misspelling my name :) or maybe I deserve a free upgrade to Vista, if the upgrade fixes this issue.

# Luci Şandor on 29 Jan 2007 11:56 PM:

I can't use Ş on this blog, I totally deserve a free upgrade to Vista so I can write my name.

# Luci Sandor on 29 Jan 2007 11:56 PM:

I can't use Ş on this blog, and I totally deserve a free upgrade to Vista so I can write my name the right way.

# Michael S. Kaplan on 30 Jan 2007 4:10 AM:

Hmmmm..... well, you can try that argument with MS Romania and see what they say. I am not sure they will consider that to be the best argument to use, depending on what your current OS is? :-)

# Cristian Secară on 30 Jan 2007 6:36 AM:

To Luci Șandor: You can write correctly your name in this blog even now, if you are on Win2000 or WinXP.

All you need to do is:

- update your Verdana font, from here http://www.microsoft.com/downloads/details.aspx?FamilyID=0ec6f335-c3de-44c5-a13d-a1e7cea5ddea&DisplayLang=en

- install my updated keyboard layout driver and use it in „Romanian (Standard)” mode to write correctly, from here http://www.secarica.ro/html/ro_kbd_winxp.html

:)

Cristi

# Luci Sandor on 30 Jan 2007 7:37 AM:

Sorry for the multiple posting: it kept reading the wrong name from the cookie when I used "Remember me", and I was afraid I wasn't making my point clear (well, in fact, a misspelling of that kind will make it my point ven more obvious). Still, it's with a cedilla all over the place, since I am using Windows XP. I have to find a Mac and zoom all the way to see if it displays correctly there.

Unfortunately MS Romania, although very helpful (I remember a smart answer for correcting the Romanan daylight saving time), is a bit overwhelmed, since it is mostly working in Hungary.

There will be no Vista discount, since thousands of Romanians will have the same issue with ş and ţ. Personally, I recall the weird way of typing these letters, using Alt Gr, I recall the times when a file could not be named with a ş or a ă, and I felt grateful since Windows 2000, even though behind the scenes, it was a cedilla all this time.

...And understod why it was this way, when I recalled Romania 20 years ago, at the height of nationalist communism and under some sort of IT embargo.

# Mar on 30 Jan 2007 9:41 AM:

Even if Google won't do it, it would be easy to make a browser extension ;-)

# eiffel on 11 Apr 2007 4:56 AM:

And as an notification - there is no Romanian Standard Keyboard on the Romanian Market - only US Standard.

# Cristian Secară on 11 Apr 2007 6:49 AM:

The German manufacturer Cherry can deliver Romanian standardized keyboards, the only problem may be that they requires a minimum ordering quantity of 1000 pieces :)

Cristi


referenced by

2007/06/07 Putting the camel's nose in Building 24

2007/01/28 Stealth features (like language detection?)

go to newer or older post, or back to index or month or day