The Unicode train left the station YEARS ago, in fact! (2012 edition)

by Michael S. Kaplan, published on 2012/03/26 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/03/26/10286723.aspx


The other day, in Sacrificing the 'A' func to make Romanian better? aka There's Something About Marţi…, I laid out a bold move to look into two possible courses for the Romanian locale:

Now for some, Tuesday is just another day.

But if you are Romanian, or at least Romanian enough to worry about the Marţi/Mar?i/Marți issues, this question becomes crucial.

Some suggestions were made in comments, like Jesse, who suggested:

Could you make a compatibility shim for applications that call GetLocaleInfoA under the Romanian language to get the original string back?  It's a hack, but it's probably the best solution if you want the best of both worlds (proper localization and supporting legacy apps).

This is an interesting idea, but there are some strong technical challenges to creating such a a shim (which would also have to support other NLS functions like GetDateFormatA and GetCalendarInfoA and so on).

And there are even low-pri potenial security risks of having the Unicode and ANSI functions returning different results that look almost the same.

Not to mention performance issues that such a shim might introduce.

Or the suggestion of our good friend Cristian, who opined:

There is a way to have "Marți" also in ANSI apps, but you don't like it :) Namely adding support for ISO/IEC 8859-16:2001 as CP28606.
 
Please, pretty please add support from this last ISO codepage! Please add Windows to the list of operating systems which support it (MacOS, Linux, ReactOS)

Now there are numerous problems that this would introduce:

  1. We are not supporting any new code pages;
  2. We are not adding any more ANSI code pages;
  3. Our ANSI/OEM code pages have never been an ISO code page;
  4. We have never ever ever ever ever ever changed a locale's ACP or OEMCP.

No matter how strongly I may feel trying to make progress in Marţi vs. Marți, I would not be terribly willing to change all of the rules here.

And even if I were so inclined, the team that owns decisions 1-3 would not want to go down that road.

Of course, as I warned in this blog and this blog and this blog and this one, the Unidode train has left the station.

In the past (eg in those other blogs, you can see people arguing that it's time to move forward.

And if taty's true, it really has to apply to Romanian, too.

Clearly if we go the Unicode route, there will be at least one KB article, wth a title something like:

Some older programs show Mar?i instead of Marți in formatted Romanian dates

With the recommended solution being to update the older program!

And if CSS couldn't find an author I'd volunteer my services. :-)

So anyway, if it really is not possible to make it work well for both cases and one had to choose, I know what I would do.

What would you do?

What would YOU do? :-)


Cristian on 26 Mar 2012 9:02 AM:

Unicode all the way - Marți.

I understand the reluctance of touching code pages, but what about keyboard layouts? Can we get a Romanian "Legacy" keyboard with s and t comma below? But this might be for another blog post.

Andrei on 26 Mar 2012 11:33 AM:

Unicode please, otherwise it will take waaaay to long to get rid of "ş" and "ţ".

Random832 on 26 Mar 2012 12:22 PM:

I would add a best fit mapping to codepage 1250. Ideally, this should have been done when Windows first claimed to support Unicode 3.0. If necessary, declare that new best fit mappings for characters added to Unicode after a codepage was first defined do not in fact constitute a new codepage or an alteration to a codepage.

Michael S. Kaplan on 26 Mar 2012 1:12 PM:

Since we have published the files, it would be considered a change. :-(

Joshua on 26 Mar 2012 3:33 PM:

What I would do: add support for UTF-8 as an ANSI codepage. There's already a number reserved for it (I think it is 65002). This would fix something like 99% of remaining programs that refuse to change.

Michael S. Kaplan on 26 Mar 2012 5:22 PM:

There are a host of reasons why we can't do *that*, previously discussed....

mpz on 26 Mar 2012 8:10 PM:

Just force the change. It has to be done sooner or later, and when you do it later, you'll find yourself wondering why you didn't make the change sooner.

Nicu on 26 Mar 2012 10:41 PM:

Go with Marți, you should not go on forever supporting a mistake from the past, is the time to move to the better and more correct version.

Azarien on 27 Mar 2012 2:09 AM:

Couldn't you just make Unicode character ț map into ANSI character ţ for Romanian locale?

Azarien on 27 Mar 2012 2:18 AM:

and just a thought: i think that Marti would still be better than Mar?i. Either way, go with Marți for Unicode API.

Random832 on 27 Mar 2012 6:27 AM:

"Since we have published the files, it would be considered a change." So was the Euro. And the reasons that the Euro change was bad aren't obviously applicable to adding a best fit mapping: it doesn't actually change the semantics of any data _in_ the codepage, or [if this were done back when 3.0 was first supported] any unicode character that already existed. It'd be more analogous to adding character properties and sorting table support for the new unicode characters themselves.

Cristian on 27 Mar 2012 6:59 AM:

Not having s and t comma bellow in any Windows code page restricts fully Unicode aware applications from other operating systems to be used on Windows without some code change.

For example Putty, Emacs, Vim cannot be used with s and t comma bellow characters because they use UTF-8 Unicode and not UTF-16 Unicode.

It's not just old ANSI applications which do not work on Windows, but also fully Unicode aware applications :-(

The title of the post should be: "The UTF-16 Unicode train left the station YEARS ago, in fact! (2012 edition)"

Mihai on 27 Mar 2012 5:13 PM:

Of course I would like to eat the cake and have it too.

But "if it really is not possible to make it work well for both cases and one had to choose", then I would force the change and go with comma.

Mihai on 27 Mar 2012 5:17 PM:

@Cristian

"For example Putty, Emacs, Vim cannot be used with s and t comma bellow characters because they use UTF-8 Unicode and not UTF-16 Unicode."

s/t with comma is the last of their problem.

If they want to handle Unicode they have to "convert at the edge" like everything else. They might even have that already.

But I have seen them misbehaving quite a bit on Linux too.

I would not really call them "fully Unicode aware" but more like "agnostic", since they (more often than not) just move a bunch of bytes around. If the CRT does the right thing (and often it does), they work. If they try to "optimize" around the CRT, they fail.

mpz on 28 Mar 2012 8:51 AM:

@Cristian

What are you talking about? PuTTY etc. support Unicode just fine. The standard in terminal communications is UTF-8, and PuTTY understands that, while talking UTF-16 to the Windows APIs. I just tried copying and pasting Marți in a terminal session, and it worked fine.

Please stop spouting garbage.


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day