by Michael S. Kaplan, published on 2006/02/05 03:11 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/05/525052.aspx
The other day, when I talked about how I was Approaching linguiticalishnessality, a comment from Thierry Fontanelle of the MS Speech and Natural Language group pointed out that the quarterly symposium in computational linguistics held by Microsoft and the University of Washington was about to happen.
It did indeed happen this last Friday, and they were two fascinating talks, one on Unsupervised Acquisition of Ateso Morphology and the other on Locating, Recognizing, and Converting Interlinear Text on the Web.
I thought I'd say a few words about each. :-)
The second talk, given by William Lewis (a Visiting Assistant Professor, Dept. of Linguistics, UW), talked about a very interesting project trying to create a searchable database of interlinear text, a common format for linguistic samples. An example of such text (this one borrowed from the excellent talk I saw the week before from Rachel Hastings) is:
Llama-kuna urqu-pi ka-n
llama-PL mountain-LOC be-3sg
'There are llamas in the mountains'
Linguists are likely used to seeing this format known as Interlinear Glossed Text (IGT). The 'gloss' refers to that middle line, the one with the tags.
It is obviously not so easy to simply use an MSN Search or a Google query to find the large number of them available on the internet, so the ODIN (Online Database of INterlinear text) project is an attempt to bring in methodologies to find these IGT examples, get their language based on surrounding text, an catalog them so they can be easily searched later.
ODIN has at this point a very good rate of few false positives when detecting IGT it can catalog, but with a high cost in terms of false negatives (i.e. many valid cases are thrown away in order to be certain that the detected cases are definitely valid).
One thing that I found to be very interesting (beyond the general negative feelings about what the PDF format has done to such searches difficult -- more on this another day!) was the great lengths that the project has to go to in order to find what is obviously a recognizably 'standard' way to represent information, probably due to the fact that there is no widely used, standard way of producing them (beyond creative use of the space bar to line up text, I mean).
It struck me that there should likely be features within products like Word that would make these things easier to regularize these things. In fact, with Word in Office 12 supporting PDF, there may be an awesome opportunity to try to make many of these issues easier to solve.
The first talk, given by Manuela Noske, a Software Localization Engineer for Windows, describes a fascinating attempt to take a corpus of ~460,000 alphanumeric tokens of Ateso, an Eastern Nilotic language using Linguistica v.2.0.4.
Manuela was honest about the fact that the results were not all that had been hoped for, in large part due to the lack of standardization of much of the language in terms of case markers, spelling, and several other morophological and phonemic issues. This is especially interesting given the huge efforts being made in Uganda to get the language online -- the lack of standardization would seemingly hinder efforts to understand Ateso usage more than a little bit!
Since most of the corpus is actually made up of Ateso periodicals, factors such as periodical styles or even author preferences cannot be discounted, and after the talk when I spoke with Manuela she suggested that these issues are definitely possible avenues to get around the limitations that Linguistica v.2.0.4 showed for such a language (more information from native speakers would also undoubtably help in weighing the importance of the differences).
During Q&A after the talk, one person pointed out how items like spell checkers often had a significant effect on such problems when they are widely used, as they actually enforce a standard on a language that is clearly struggling through such variations.
That idea scares me a little, since the thought of a spell checker having such an influence is scary for any mistake it might contain. I mean, it is an awesome responsibility to know that a mistake will lead to improperly spelled words in school book reports, but that is nothing compared to the effect a mistake could have on a languge struggling to find its own proper usage!
In any case, both talks were very interesting and I got to have several conversations after the talks, too. I don't know that I will be able to go to every talk that happens, but I will definitely try to attend them when I can. :-)
This post brought to you by "ƀ" (U+0180, a.k.a. LATIN SMALL LETTER B WITH STROKE)
# Peter on 5 Feb 2006 3:17 PM:
# Michael S. Kaplan on 5 Feb 2006 7:31 PM: