by Michael S. Kaplan, published on 2010/12/04

You may have heard of Microsoft's Policheck before. If not there are mentions all over this whole blog site from many people and also on the web as a whole. It's existence isn't a secret.

It is a powerful tool that is used to help flag words that a customer might consider offensive if they were to come across it while using a Microsoft product or reviewing source code or whatever. Sometimes teams even use it to find other kinds of errors (like here for example).

Now one of the interesting problems that comes up is what the title talks about: the fact that a swear word in one language or culture might mean something else when used in another context (another language, another culture, another country, another community).

There was a piece of the Hitchhiker's trilogy that had a little fun with one aspect of the idea:

"Share and Enjoy!" is, of course, the company motto of the hugely successful Sirius Cybernetics Corporation Complaints division, which now covers the major land masses of three medium sized planets and is the only part of the Corporation to show a consistent profit in recent years.

The motto stands—or stood—in three mile high illuminated letters near the Complaints Department spaceport on Eadrax: Share and Enjoy. Unfortunately its weight was such that shortly after it was erected, the ground beneath the letters caved in and they dropped for nearly half their length through the underground offices of many talented young complaints executives—now deceased. The protruding upper halves of the letters now appear, in the local language, to read “Go stick your head in a pig”, and are no longer illuminated, except at times of special celebration.

But on a more serious note, as the site points out -- those words are quotes below but they are something that this blog's profanity filters will star out in ton of places (see the link for the actual text) though its incomplete coverage of other language curse words is funny on its own:

The situation is rendered more complex when other languages enter the picture. *** in French, and Scheiße in German (both usually translated as ***) are also quite common. It is also interesting to note that while German and other languages' profanity seems to focus on precipitation, English seems to have an issue with sexuality in this respect. Likewise, in European Spanish, coño (usually translated as *** in English) is very common in informal spoken discourse, meaning no more than "Hey!" or "Christ!"

Some scholars have noted that while the French and Spanish are comfortable hearing native speakers use these words, they tend to hear the "stronger" meaning when the same words are spoken by non-native speakers. This may be similar to the differences in the acceptability of *** or *** depending on who is saying the words. Or it may be an example of how it is easier to learn swear words in a new language or dialect than to learn the fine shades of intensity which accompany their use.

A profane word in one language often sounds like an ordinary word in another. *** sounds like the French words for seal (phoque) and jib (foc), as well as the Romania word for do (I do = eu fac); *** sounds like the Russian for "to sew". Even names in one language may appear as vulgar words in another linguistic community, which causes many immigrants to change their names (common Vietnamese personal names include Phuc and Bich). A particular coincidence is the Hungarian and Spanish words for curve: Spanish curva sounds like a Slavic and Hungarian kurva meaning "prostitute", and Hungarian kanyar sounds like coño, mentioned above. In Romanian curva means "prostitute". See another example in Laputa. Additionally, *** is genitive and accusative case of two often used words in south Slavic languages; but in Portuguese, means "prostitute", and filho da - is an offensive word, similar to son of a b****.

Now back to Policheck.

The way it or any similar tool has to do its work is the same way that Word spellcheckers support different locales: different source dictionaries are needed for these different contexts.

Because one person's swear word is another person's word that is the one and only word that would be considered correct to use.

Such a thing adds interesting challenges to thinks like Microsoft's locale data, which ends up a lot like the above paragraph, since by its very nature it is a collection of data for many different cultures where each one has:

There is also data that is localized into the languages that Microsoft localizes into, but that is stored elsewhere and doesn't have the same kinds of problems since tools like Policheck know exactly what language to use for each time it is run against a huge set of localized data.

But getting back to the locale data -- to get the best coverage, it either has to have targeted runs on specfic subsets of the data, or it has to be run so widely (with every dictionary on all data) that false positives are pretty much guaranteed.

Add to that the fact that many terms that are caught specifically because someone might use them incorrectly (e.g. Taiwan or Macedonia) but the locale data contains the terms carefully and deliberately -- which means more false positives,

It is still worth doing the work, though. Because there are obvious disadvantages of offending customers when it comes to their overall satisfaction with the product.

Of course the biggest "violator" of terms according to this tool, which every inappropriate word in exactly the worst possible context to cause offense, is the tables of data in Policheckitself. Given the fact that every single term in it would be flagged, if one were to run the tool on its own source and data, one would be swamped with reports of problematic words. I imagine it still has to be run, to check for unexpected violations (they just have to ignore the huge number of expected cases!0.

But as challenges go, it is quite unique....

Jens on 7 Dec 2010 3:22 AM:

Nice one, Michael! False positives can be really annoying, though, like EnumFontFamilie***, which I saw in some coder's forum that was probably not using Policheck. Makes web searches a bit more challenging.

Michael S. Kaplan on 7 Dec 2010 2:09 PM:

Did you do those *'s, or did the site censoring filter?

TEST: EnumFontFamilieSex

