Syntax Savvy dictionaries?

by Michael S. Kaplan, published on 2005/10/14 03:28 -04:00, original URI:

Carlos asked (in the suggestion box):

I don't know if this falls within your purview, but it seems like it would at least be a tangent. I'd like to find out why the spell check dictionary couldn't be a just little more savvy about how language operates. For example, if I'm using Word and it doesn't have "Valkyries" in the dictionary, when I add it I should be able to specify the part of speech and any other forms of the word. So I'd be able to tell the dictionary that it's a plural noun and the singular is "Valkyrie". Ideally, such a system would be aware of how the language works and would know the how regular forms are constructed, leaving the user to fill in the irregular forms on their own.

Obviously, this would mean having a set of rules for each language, and some languages have many, many forms per word, but each language requires it's own dictionary anyway right?

Well Carlos, I actually talked about such an interface that allows one to expose such rules to a consumer of information in the post IStemmer'ed the tide (or, Language-specific processing #2).

As that topic hints, it is  non-trivial task to handle this sort of thing per language, and clearly there is no simple way to include the rules in something like the Word custom dictionary files (which are basically text files). It would in fact require Word to be a consumer using an interface such as IStemmer and IWordBreak, and then of course word breakers and stemmers in the target languages to do the actual work.

It is easy to imagine this sort of thing already occurring and beding responsible for the current dictionary file, but is hard to see how we can add terms to that list -- not impossible but it is certainly a complicated problem (one that as far as I know is not currentl'y being solved....


This post sponsored by U+200d (ZERO WIDTH JOINER) and U+200c (ZERO WIDTH NON-JOINER)

no comments

go to newer or older post, or back to index or month or day