The letters in a language....

by Michael S. Kaplan, published on 2004/12/10 02:15 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/10/279398.aspx

Earlier today, I got a question from someone who is a reader of Jeppe's Weblog, asking is there a method or API or resource can be used to get all of the letters in a language?

There is usually a very good reason to be asking it -- someone might be trying to create a phone book or filing system where they want to use the different letters used by the language for each "bucket". Or they may be trying to do groupings for some reason similar to that. And it does not seem to be an unreasonable request. It is about what this person was asking for.

The answer is an unfortunate one -- there is no way to retrieve this information. It simply does not exist in any of the NLS data.

But when one thinks about the question itself, its not as easy as it seems. Especially when one considers how often loan words from another language may include letters or diacritics one would not otherwise see. Would one simply not file those entries? And obviously phone books with names could not ignore if someone's name uses letters that do not have a clear letter to be under. This is the kind of issue that really blocks how such a feature would implemented -- especially when one considers how many characters in Unicode are not covered by a language compared to how many are....

Of course, the needs of Windows and the .NET Framework have to be pretty generic, which makes the problem harder. In the meantime, if an individual application or market is important, you can work with a native speaker of a languge to find out the sensible letter groupings that they would expect in the application(s) in question. A generic API to handle any language is most likely not required (and might give too much information anyway if it existed).

Jonathan: My best answer would be the titles in a vocabulary.

For example, we had a discussion with Mark Davis of Unicode about Punjabi. I was arguing that since the letters (under my definition) begin with ੳ ਅ and ੲ, and that the normal vowel signs like ੁ can be used to make the more widely used ਉ letters, then the canonical order should reflect this... Of course the result would be horribly bad when it came to performances, it certainly was a devil's advocate position.

Also, there is a hot debate about something in CDLR (Locale database, http://www.unicode.org/cldr; sorry I do not know how to do a nice looking <A> link) called "exemplarCharacters". In French, we had a long time to convince the Americans about the status of ü or ÿ. In Catalan (a minority language in Spain), ñ is not part of the official set; however, there is a lot of people living in Catalonia whose last name, of (Castilian) Spanish origin, have a ñ; so as a result, the letter has a good frequency in things like phone books.
Similarly, a recent move said that Catalan can now be used to record names (Registro civil). But the software used does not accept · (midpoint, a unique feature of the Catalan language) into names...

12/10/2004 6:16 AM Uwe Keim

> Console.Write( (char)i );

That fails badly, in a programming language where char can only hold a byte instead of a character, on a machine where a byte cannot store all values up to 65535. Know of any? ^_^

Another idea might be to enumerate over a code page, for each byte value trying both that single-byte value and enumerating all possible double-byte values, but I'm not sure if that works. A subset of such an enumeration surely would not work unless you already know the characteristics of the code page.

Another idea might be to enumerate all known Unicode characters and try to convert each to a code page character, but we've seen information in other postings showing that this would miss some valid code page characters.

12/22/2004 11:16 AM Antoine

> Similarly, a recent move said that Catalan
> can now be used to record names (Registro
> civil). But the software used does not
> accept · (midpoint, a unique feature of the
> Catalan language) into names...

That unique feature of the Catalan language is commonly used in Japanese. In ordinary Japanese text it sort of means "and/or". In a name of a company name containing too much kana(occasionally) or a foreign person's name (frequently) or a transliterated foreign phrase, it simply separates words.

In official forms with family name written in one box and given name written in another box, there isn't much need for that unique feature of the Catalan language. But if a foreigner with more than one given name has some need to write more than one in the box for given name, then the separator might be appropriate.

With software it's pretty much random. Ordinary software of course accepts it because it's a character. Software that validates whether a person's name is written using a limited character set that the developer wants to limit it to for some unknown reason or no reason, often the separator is not accepted.

In an extreme example of the latter, recently I made a submission through a company's web page, and they required both the ordinary written name (mine is in katakana) and pronunciation in hiragana only. It kept rejecting my pronunciation, so I finally entered the hiragana for fukanou, meaning impossible, instead of the pronunciation of my name, and the web site accepted that. Later I guessed that it was rejecting the character that looks like a dash and means double-length pronunciation of a syllable, because that character is not usually used in hiragana. Sheesh, that sure is part of the Japanese pronunciation of my name.