by Michael S. Kaplan, published on 2007/07/04 21:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/07/04/3693118.aspx
Somebody asked me the other day about a post from Language Log that Bill Poser wrote entitled Standardizing away the world's languages. They wanted to know whether the claims were true, and also whether I agreed with the post's conclusions (as he thought the mention of me near the end of the post might have been hinting at).
The post specifically mentioned that Open XML:
...does not follow ISO-639-3. Instead (section 2.18.52), it requires that languages be specified by means of two hexadecimal digits, e.g. 0x09 for English. That means that no more than 256 languages can be accomodated. The list of languages available is in the document referenced above on pp. 2531-2537 but for the two-letter hex codes you'll have to look elsewhere because Microsoft doesn't list them together with the languages. For some reason it gives a completely different set of non-hexadecimal codes ranging from 1025 to 58,380....
In short, the Open Document standard provides for all the languages in the world, while Open XML excludes the great majority. This isn't a matter of ignorance. Microsoft has employees like Michael Kaplan who are quite knowledgable about the world's languages and the technical issues that they raise, but business strategy comes first.
Let me start by saying that I don't read any such hint in this text -- the way I read it, Bill was just pointing out that he believee Microsoft "gets it" (and then gave me as an example of proof!) but pointed out that perhaps a more base motive inspired the language in the standard, which would tend to anticipate a more exclusionary posture was being applied.
(I will note in passing that the language ID portion of the LCID is actually ten bits, not eight, so the limit is not as sharp as the text above, as I have mentioned previously in posts like Why do LCIDs skip around so much?)
I feel vaguely complimented (and a little embarrassed!), and never mind a kind hat tip. Though just as when Mark Liberman did the same thing, I have to admit that it would be quite easy for one part of Microsoft to get things while other parts would not -- I know that is true since much of my job is helping many people inside of Microsoft when they don't quite get it, yet. So one could easily claim that ignorance was in fact a cause for an unfortunate approach.
And I have not been afraid to point out such ignorance in the past. :-)
However, in reading the standard itself, here is what I found (looking at the contents of this link since the original link did not work),specifically in the big 38.6mb part 4 document entitled Office Open XML Part 4 - Markup Language Reference.pdf - first on pp. 263-4:
22.214.171.124 lid (Language ID for Phonetic Guide)This element specifies the language which shall be for this phonetic guide.
[Example: Consider a run of phonetic guide text which is using Japanese as it language. This constraint is
specified using the following WordprocessingML:
The lid property is ja-JP for the phonetic guide, so the phonetic guide is specified to be Japanese. end example]
val(Language Code) Specifies an ISO 639-1 letter code or 4 digit hexadecimal code for a specific language.
This code is interpreted in the context of the parent XML element.
[Example: Consider an object which shall specify the English(Canada) language. That
object would use the ISO 639-1 letter code of en-CA to specify this language. end example]
The possible values for this attribute are defined by the ST_Lang simple type (§2.18.51).
(And then with that same text repeated again in other areas like on pp. 565-6 in 126.96.36.199 lid (Date Picker Language ID) and on pp.1084-5 in 2.14.17 lid (Merge Field Name Language ID) and on p. 4166 in 188.8.131.52 ST_TextLanguageID (Chart Language Tag).
and then there is the section they all refer to that starts on p.1754:
2.18.51 ST_Lang (Language Reference)
This simple type specifies that its contents will contain one of the following:
The contents of this language are interpreted based on the context of the parent XML element.
- A hexadecimal language code (ST_LangCode)
- An ISO 639-1 letter code plus a dash plus an ISO 3166-1 alpha-2 letter code (ST_String)
[Example: Consider a language code defined as follows :
<w:lang w:val=”en-CA” />
This language is therefore specified as English (en) and Canada (CA), resulting in use of the English (Canada)
language setting. end example]
This simple type is defined as a union of the following types:
- TheST_LangCode simple type (§2.18.52).
- TheST_String simple type (§2.18.89).
Now 2.18.52 is a table of decimal LCID values, something that goes quite well with Office's habits as I have pointed out previously. I find the list without using hex and without the context of the source constants in winnls.h to be kind of confusing, but no matter since if the string type is used, the potential set is no longer bounded as LCIDs are and uses the locale/culture name system first provided to them by Windows and .NET that fits in with RFC 4646 (Tags for Identifying Languages) and its sister RFC 4647 (Matching of Language Tags), as mentioned here.
I guess the descriptive text is pretty over-simplistic since the RFCs are themselves more complicated than "An ISO 639-1 letter code plus a dash plus an ISO 3166-1 alpha-2 letter code" -- in fact, although the two-letter codes are preferred when they exist, moving into the three letter ISO-639 codes and even the numbers for region are both considered acceptable, though I assume there is some problem with potential ISO standards referencing RFCs (though they do reference the precursor RFC 3066 elsewhere!) or even Windows SDK header files. So the language is as little clumsy given those kinds of restrictions or misunderstandings thereof, but there is little doubt what is being referred to; the intent is clear.
Now given that the original document that Bill Poser pointed to is not there and the new set of documents does not fully match what he describes, it might be fair to say that somehow in the intervening months, the folks who didn't get this stuff can be said to be getting it much better now.
Which is never a bad thing.... :-)
This post brought to you by ṹ (U+1e79, a.k.a. LATIN SMALL LETTER U WITH TILDE AND ACUTE)
go to newer or older post, or back to index or month or day