Getting the '実' out of word breakers

by Michael S. Kaplan, published on 2006/12/04 04:17 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/12/04/1203808.aspx

Raymond Hung's question he was working on related to Sharepoint was:

Hi all,

I can use some help on a Japanese search issue.

I am working on a Portal search issue where searching 実績 returns but 実 or 績 by itself does not.

After consulting with the product team, this is by design because (実績 JISSEKI) & substring of the word (実 JISS). In fact, the query is not a “single word”, but a part of a word. That is the reason they don’t have match by our word-based search.

I do not understand the substring part of the explanation. I thought 実 and 績 consider a single word.

Indeed, it is true that 実績 is a single word, which I believe means "result" (kind of ironic in this case, since the question amounts to saying "I don't understand Sharepoint results here". :-)

Luckily long time international expert and now General Manager Chris Pratley chimed in with an explanation:

実 can be a word by itself, but then you should expect it to only hit on that usage. For example実はいらない。

実績 will be determined by a word breaker to be a compound word, so it would not match on 実 by design, just as “underground” is not found in a search on “under”.

Some search tools allow substring matching, where you could ask the search to match just by character. (in fact, unless a wordbreaker is available, this is all you can do). Since many low-tech search tools have not historically had wordbreakers, customers might expect a character match (which is less precise in that it returns many more hits than are wanted usually, but it won’t miss any occurrence of that character).

The bit of technology that is doing the heavy lifting here is the Japanese word breaker, which is not only used by SharePoint but by the full text search engines of Index Server, Exchange Server, and SQL Server.

Although it is easy to want the search to find the piece within the "word" in this case, since it is a word itself, Chris rightly points out the reason why it doesn't. I'll extend it a bit and talk about the actual meaning of the words here per SysTran's BabelFish:

実 - truth
績 - weaving/grade
実績 - result

It is easy if one doesn't know Japanese to rebel at this a bit, but let's take the word understand in English:

under -- beneath or covered by, or below the surface of
stand -- (v) to rise to one's feet, (n) the act of standing
understand - to perceive the meaning of

One of the many factors that cause people to think that Google or Live.com returns more "relevant" reults is that they generally won't find the word "understand" when one seraches for either "under" or "stand". Because a word like "understand" has an entirely unrelated meaning, even in cases where an etymology might have a link between such words.

And it is the wordbreaker which is deciding to do things like considering "understand" to be an indivisible unit.

At that point its friend the word stemmer can make sure that understood, understanding, understands, and so on would also be able to be found, given that word that the wordbreaker has identified.

The workaround for situations where this kind of result is not desired is easy, as Chris indicated: just make sure that the index is using a word breaker that has no understanding of the language, so that it has no knowledge of these compound words.

Though in most situations that is a step that is simply not necessary. People expect the results that the word breasker most attuned to the language will understand....

And that's the 実.

This post brought to you by 実 (U+5b9f, a CJK ideograph)

b6s on 4 Dec 2006 10:37 AM:

Chinese characters on searching/indexing have word segmentation issues.

To common full text search situation, however, an index by each single Chinese character is enough, especially when modern full text search engines usually also consider the distance between characters -- if you search for 実績, texts that contains 実績 are more relevant than those contains only 実 or 績 or both but not near to each other.

So, my suggestion is, to index Chinese characters in Unicode and to use "other" modern full text search engine (sorry about that).

LesC on 4 Dec 2006 11:13 PM:

Are we being ethnocentric here?

As a native English-speaking linguist I see that 実 and 実績 are related at a "theoretical" level, but semantically they are different and a search for the former should not return the latter.

Two (I am guessing) "chinese" speakers are saying that they are closely related in meaning and therefore should be returned.

What would a Japanese speaker say?

As an aside 実 is one of those odd cases where its common stand-alone reading (JITSU) is the on-yomi usually used in compound readings, which perhaps strengthens the case for a mental matching. By comparison, would anyone suggest that a search for 米 ("kome"= rice) should match with 米国 ("Beikoku"= America)?

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day