Latvian. Genitive. Oops.

by Michael S. Kaplan, published on 2010/09/09 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/09/09/10059644.aspx


This blog is about an issue a customer forwaded to me recently, but I found out (just after I had it mostly written) that there was a bug reported already. I still give the reader credit for bringing it to my attention (I can't always be randomly scrubbing the bug database for blog topics!).

For the past several years that have been some topics that have come up over and over again. One of those topics is genitive dates, where all of the following blogs and more have brought up various issues:

Now in many of these blogs I have hinted at the underlying fact that a language which would have different spellings for the nominative and genitive spellings of month names has actual grammatical rules that control when each spelling would be expected to be used.

Thus there are grammatical rules that help one determine whether one should be using

Сентябрь

or

сентября

for September when one is writing something out in Russian.

Now in many cases, indeed in most cases involving the date formats in question, the rules are almost simple enough that the code used in Windows and the .Net Framework can properly discern which one to use.

But not always in all of them....

One example where we do not so well here is Latvian (lv-LV, aka 0x0426).

We have the two different forms:

nominative form genitive form
January janvāris janvārī
February februāris februārī
March marts martā
April aprīlis aprīlī
May maijs maijā
June jūnijs jūnijā
July jūlijs jūlijā
August augusts augustā
September septembris septembrī
October oktobris oktobrī
November novembris novembrī
December decembris decembrī

That part is easy.

But then if you look at the default long date format for Latvian, it is:

dddd, yyyy'. gada 'd. MMMM

and GetDateFormat will change that into:

trešdiena, 2010. gada 8. septembrî

Now in this form, with the month name isolated that way, the genitive form of the month name is much less obviously what one might expect to be there (and I think it is technically grammatically incorrect). But really there are only four ways to "fix" this problem:

  1. Build a much more intelligent algorithm to determine which form to use, depending on the date format, or
  2. Remove the genitive forms entirely from the locale data, or
  3. Change the default long date format, or
  4. Add a new token for months and abbreviated months that means "use genitive form if it exists".

But each of these solutions has problems with it:

#1 is a hugely complicated work item that would be very unlikely be done -- the rules may even be language specific! And as I have poinetd out previously, may also technically apply to day names when day names are part of the format!

#2 is great for this default long date format, but will cause the date to be wrong again if anyone customizes the forrmat to be more like what locales like Russian use in terms of word order/placement ("d MMMM yyyy 'г.'").

#3 is great if the format is wrong, but if it is what people generally expect then it is not so great.

#4 is probably the most versatile of the solutions in terms of how much work it would be, but it adds whole new complicated works items in .Net, Windows, Regional Options, and the data of many locales that simply make it kind of unlikely (and it breaks anyone who would be depending on the existing behavior unless a new token was also added for "always use nominative form". It is less work than #1 but still it ends up probably being too much work for anyone to do.

The question is a hard one.

I suppose ultimately #2 may be the "best" answer of these four, though when I say that sentence aloud the word best leaves a rather bad taste in my mouth.

In the meantime, Latvian isn't any less broken, and who knows of this problem applies to other locales (especially when the format changes). :-(


John Cowan on 9 Sep 2010 8:33 AM:

Mojibake alert!  There is no ð (eth) in Latvian orthography.  In fact the word for 'Wednesday' is 'trešdiena'.

I'll comment here rather than on the other posts.

Sami: There are actually six functioning Sami languages, five of which are listed in your post; Kildin Sami is presumably omitted because it's written in Cyrillic.  Speaking of "the Sami language" is a misnomer, and people who do it usually refer to Northern Sami only, which is far and away the biggest and the most thriving of the group (20,000 Northern Sami speakers, 2000 Lule Sami speakers, and a few hundred for the other four; three more Sami languages have 10-20 older adult speakers only).

Inflected weekdays:  The point is that the name of a weekday is a noun, and in languages that inflect nouns for case, it may well be inflected too.  So if you need to say "from Monday, April 21 to Thursday, April 24", the words for "Monday" and "Thursday" may well need appropriate case inflections put on them, with or without prepositions or postpositions.  But of course no multilingual API can reasonably do that, as there are too many different ways to use dates in sentences and too many conflicting and overlapping case systems.

By the same token, a numeric API might know that 1000 is spelled "thousand" in English, but it's not going to be able to deal with the fact that we say "a thousand men" and "five thousand men" but "thousands of men" in running text.  Older varieties of English would say "a thousand of men" and "five thousands of men", and this hung on till the 19th century with "million" and larger number words: an 1829 grammar is still saying that "a million men" is un-English, whereas both "a thousand of men" and "a thousand men" are acceptable.

Michael S. Kaplan on 9 Sep 2010 8:39 AM:

Oops -- problem with the tool I grabbed stuff out of. :-)

Clearly that full situation cannot be handled, but a better job is theoretically possible for month names in date formats, given their nature. Certainly the 90% could could be hit with method #4, for example. But it is unlikely that can really happen....

Michael S. Kaplan on 9 Sep 2010 8:40 AM:

For Sami, I have talked about why we don't support the Cyrillic Samis (and why there are nine Sami locales in Windows) in the past....

John Cowan on 9 Sep 2010 9:54 AM:

You have?  I couldn't find any references to Cyrillic Sami locales with either the blog search or site-limited Google search.

Michael S. Kaplan on 9 Sep 2010 2:22 PM:

Oh, I hint at it here, but I guess I never talked about it. Ok, look for an upcoming blog on Sami!

Alex Cohn on 13 Sep 2010 12:51 PM:

I guess the trouble of #4 is exaggerated slightly. Anyway, the long date format should, IMHO, depend most and foremost on the current input language, and not on the "regional settings". When I am writing in English, the month names should naturally be January, February, etc. But when I want to insert a timestamp (it's only F5 away) to my note written in Hebrew, I expect the system to use Hebrew names of the months, if not the Hebrew dates altogether (but that's a different story).

Well, people have problems with the date format in plain English: www.xpheads.com/.../160540-notepad-timestamp-yyyy-mm-dd-hh-mm.html.

Furthermore, the DD-MMMM-YYYY or whatever 'format' which is relevant for one language, makes no sense in another. In English, I would choose September 13, 2010 while in Russian it must be 13 сентября 2010 года (yes, genitive for month name and for the word "year"). Such 'cultural' setting is pretty stable, and could be hardcoded with other keyboard preferences. It may be easy (for somebody who speaks the language) to choose the nominative or genitive form from a drop-down box, without ever worrying about the underlying tokens. And for the happy users of English or Hebrew, where no separate genitive forms exist, the drop-down choice will simply be shorter.

Oh, and this way you could use the same UI to easily choose between יום ג'‏ and יום שלישי, which are both legitimate ways to express Tuesday. Or Tue and Tuesday. Or even between ספטמבר and תשרי.

Which reminds me to wish you happy New Year. שנה טובה!

Michael S. Kaplan on 13 Sep 2010 12:54 PM:

Well, this is an interesting model, but not the one that Windows ever uses....


referenced by

2011/10/14 Improving genitive. Or not.... (part 1)

2010/09/15 How to format? What locale?

2010/09/13 Olive, the other reindeer, gets to Sort it all Out too....

go to newer or older post, or back to index or month or day