Is a synopsis of a symposium a synopsium? :-)

by Michael S. Kaplan, published on 2006/02/05 03:11 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/05/525052.aspx

The other day, when I talked about how I was Approaching linguiticalishnessality, a comment from Thierry Fontanelle of the MS Speech and Natural Language group pointed out that the quarterly symposium in computational linguistics held by Microsoft and the University of Washington was about to happen.

It did indeed happen this last Friday, and they were two fascinating talks, one on Unsupervised Acquisition of Ateso Morphology and the other on Locating, Recognizing, and Converting Interlinear Text on the Web.

The second talk, given by William Lewis (a Visiting Assistant Professor, Dept. of Linguistics, UW), talked about a very interesting project trying to create a searchable database of interlinear text, a common format for linguistic samples. An example of such text (this one borrowed from the excellent talk I saw the week before from Rachel Hastings) is:

Linguists are likely used to seeing this format known as Interlinear Glossed Text (IGT). The 'gloss' refers to that middle line, the one with the tags.

It is obviously not so easy to simply use an MSN Search or a Google query to find the large number of them available on the internet, so the ODIN (Online Database of INterlinear text) project is an attempt to bring in methodologies to find these IGT examples, get their language based on surrounding text, an catalog them so they can be easily searched later.

ODIN has at this point a very good rate of few false positives when detecting IGT it can catalog, but with a high cost in terms of false negatives (i.e. many valid cases are thrown away in order to be certain that the detected cases are definitely valid).

One thing that I found to be very interesting (beyond the general negative feelings about what the PDF format has done to such searches difficult -- more on this another day!) was the great lengths that the project has to go to in order to find what is obviously a recognizably 'standard' way to represent information, probably due to the fact that there is no widely used, standard way of producing them (beyond creative use of the space bar to line up text, I mean).

It struck me that there should likely be features within products like Word that would make these things easier to regularize these things. In fact, with Word in Office 12 supporting PDF, there may be an awesome opportunity to try to make many of these issues easier to solve.

The first talk, given by Manuela Noske, a Software Localization Engineer for Windows, describes a fascinating attempt to take a corpus of ~460,000 alphanumeric tokens of Ateso, an Eastern Nilotic language using Linguistica v.2.0.4.

Manuela was honest about the fact that the results were not all that had been hoped for, in large part due to the lack of standardization of much of the language in terms of case markers, spelling, and several other morophological and phonemic issues. This is especially interesting given the huge efforts being made in Uganda to get the language online -- the lack of standardization would seemingly hinder efforts to understand Ateso usage more than a little bit!

Since most of the corpus is actually made up of Ateso periodicals, factors such as periodical styles or even author preferences cannot be discounted, and after the talk when I spoke with Manuela she suggested that these issues are definitely possible avenues to get around the limitations that Linguistica v.2.0.4 showed for such a language (more information from native speakers would also undoubtably help in weighing the importance of the differences).

During Q&A after the talk, one person pointed out how items like spell checkers often had a significant effect on such problems when they are widely used, as they actually enforce a standard on a language that is clearly struggling through such variations.

That idea scares me a little, since the thought of a spell checker having such an influence is scary for any mistake it might contain. I mean, it is an awesome responsibility to know that a mistake will lead to improperly spelled words in school book reports, but that is nothing compared to the effect a mistake could have on a languge struggling to find its own proper usage!

In any case, both talks were very interesting and I got to have several conversations after the talks, too. I don't know that I will be able to go to every talk that happens, but I will definitely try to attend them when I can. :-)

This post brought to you by "ƀ" (U+0180, a.k.a. LATIN SMALL LETTER B WITH STROKE)

Re: "probably due to the fact that there is no widely used, standard way of producing them (beyond creative use of the space bar to line up text, I mean)."

There are software tools out there specifically intended for linguistic-analysis tasks including preparation of IGT. For instance, a large proportion of linguists use an app called The Linguist's Shoebox, or it's Unicode-enabled revision The Field Linguist's Toolbox (http://www.sil.org/computing/catalog/show_software.asp?id=79). These apps will store your corpus of texts, store lexical data, store wordforms and morphological analyses, and facilitate the creation of interlinear-annotated texts (IGT).

BTW, the three-line format shown in the example -- analyzed practical orthography + morpheme gloss + free translation -- is probably the most widely used presentation format for IGT, but it is not the only one. Apps like Toolbox give the user flexibility in terms of how many lines of annotation and what kinds of annotation they want: practical orthography, phonetic transcription, morphemic analysis, part of speech, morpheme gloss(es) (maybe in more than one analysis language), lexeme gloss(es), ...

So, these apps allow the linguist to create IGT, and they will provide support for formatting these on a page so they can be printed out. A big problem, though, is that the linguist would like to put these samples into a Word doc or HTML page. This has a couple of major problems related to the layout: how do you format with nice proportional fonts and keep the stacks aligned? How do you deal with reflow if line lenghts change (e.g. change margins, or the line length is dynamic (HTML))?

Lingusts often use fixed pitch fonts to align stacks. When I wrote my MA thesis using Word for DOS, I used tabs to align stacks with tab stops every .2"; format was dictated, so I didn't need to worry about changing line lengths. When I was last working with IGT (over 10 years ago now!), I used to use Word's formula fields to put the stacks into arrays: I wrote Word Basic macros that would create these. This worked great for reflow if the line length changed, though there were some shortcomings.

All these options are a pain, though; what the linguist really wants is a word processor and web browser that understands this kind of formatted information. Something like an ActiveX control that would allow embedding of (say) Toolbox data into a Word doc could work for the former scenario; obviously native support would be better, though it is a bit of a specialized format. For Web pages, linguists can work on standardizing an XML language for IGT (and there have been some efforts in this direction), but they might also need to pursue a W3C recommendation if they want to get native support in most browsers.

That's the problem of _creating_; for seaching, ODIN and OLAC (the Open Language Archive Consortium) are probably the right solution: browsers aren't designed to search for pages with particular kinds of layout, but the OLAC catalog uses standardized metadata for exactly this kind of thing.