How to be un-international

by Michael S. Kaplan, published on 2006/01/08 07:31 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2006/01/08/510565.aspx


From Found in Translation is the post Filtering Out What You Don't Understand. Basically the following dialog is added:

I have to agree that the new Outlook 2003 SP2 option to filter out all email from a specific top level domain could be on a tab labeled Uninternational.

It is possible that they have talked themselves out of believing that since one could also filter the .us top level domain:

and the US-ASCII encoding:

But since most URLs from the US have no country-specific top level domain and since most emails are in one of multiple different possible encoding that each of the items on that list may or may not cover, this is not the most usable UI to actually filter knowlegably. Plus I don't think most people understand what items like Latin 3 or Latin 9 are, even if it covers their own languages.

I don't even want to get into the fact that they give an option to block .GB (United Kingdom) even though so much of the web from there is under ones like .CO.UK -- how did they build this list, anyway?

In any case, most people might blindly hit that "Select All" button and call it a day.

Two thumbs up for the sake of those who would like to ignore the world (though it looks like it will not even do that as well as the provincial might like). But two thumbs down for the sake of the rest of the world. At least it is not on by default....

(via a contact mail from Patrick Hall)

 

This post brought to you by "δ" (U+03b4, a.k.a. GREEK SMALL LETTER DELTA)


# Ben Bryant on Sunday, January 08, 2006 11:22 AM:

How absolutely inane.

# Ben Cooke on Sunday, January 08, 2006 12:11 PM:

Hmm. Interesting.

Those "Select All" buttons are crazy. Who on earth would want to block *all* encodings?

What intriegues me most, though, is that loads and loads of character encodings have the US-ASCII codepoints stuck on the front of them. Admittedly this doesn't come up much in business email, but on USENET I've often read messages from people in other countries with a native encoding selected but with the entire message written in English using the Latin alphabet. Does this option just block based on the Content-Encoding header, or does it actually analyse the content? In the latter case, I'd expect it to be an encoding *whitelist*, not blacklist, since some messages might well contain both (say) English and Russian, and I'm quite capable of reading the English part.

What is the motivation for that encoding-based blocking anyway? Presumably if spam started getting blocked based on the fact that it used some non-Latin encoding, spammers would just start using UTF-8 which isn't really safe to block because it could be anything. I send English messages in UTF-8 all the time.

# Mike Dimmick on Sunday, January 08, 2006 12:22 PM:

Well, GB is my country's official ISO two-letter code. But when the United Kingdom of Great Britain and Northern Ireland (to give its full title) was added to the Internet, they weren't really thinking in terms of ISO country codes yet - I think the .uk addressing actually predates DNS! When someone pointed out the error, .gb was issued too. You very occasionally see .gb addresses - about six or seven years ago a friend of mine worked for DERA, the Defence Evaluation and Research Agency, an arm of the government, which used dra.hmg.gb addresses. They were eventually part-privatised into QinetiQ Ltd, with the remaining government functions becoming the Defence Science and Technology Laboratory. Neither of these organisations now has a .gb address, so you probably could block the GB domain entirely. It probably wouldn't get you anywhere though as it's unlikely that spammers would try to send from an effectively unused domain.

We also have the peculiarity of having almost entirely three-label DNS allocations - I think only the National Health Service operates a service on two labels (nhs.uk), although I notice that www.gov.uk redirects to www.direct.gov.uk. I don't think there are many other countries that follow this pattern (although I believe Japan and Taiwan do).

Our early entry to the Internet has lead to some oddly concise DNS names for the early adopters, for example Edinburgh University has ed.ac.uk and Birmingham bham.ac.uk.

# Peter Ibbotson on Sunday, January 08, 2006 1:49 PM:

IIRC The UK bit is hangover from when JANET email got reversed. I think my old email address at college was peteri@uk.ac.qmc.cs for the most part you could convert JANET address to internet dns by reversing them. It gets a bit more complicated when you add back in UUCP bang path stuff googling suggests it would have been seismo!mcvax!ukc!qmc-cs!peteri
This web page http://www.michaelkaul.de/History/history.html has a timeframe and suggests that .uk was the first country to register.

# Michael S. Kaplan on Sunday, January 08, 2006 2:09 PM:

Note that this is not blocking character ranges -- it is blocking the encoding of entire emails. so blocking all will not block all emails, just all the ones in the list....

# Michael S. Kaplan on Sunday, January 08, 2006 2:27 PM:

For the .GB vs. .UK issue, it kind of shows the problem that the .US one has too -- that there is no effective way to actually block a lot of what is out there.

This seems to me like a not all-that-well implemented pair of ideas, all things considered.

# Gabe on Sunday, January 08, 2006 4:29 PM:

I agree that filtering by encoding is not too useful because most encodings include a Latin subset anyway and (at least in my case) foreign spam is usually in utf-8 anyway. I think that much like international domain names, it should probably block scripts you don't understand because that's more specific than encoding.

People blast Russian and Korean spam all over the world. Of course it may not be Korean, because I can't read null glyphs very well. If I can't even read it, why would I not want it filtered automatically?

Also, there are a lot of people who have no need for international communication via the Internet (my mother, for example). If there's an inordinate amount of spam coming from China (and there is), it makes sense that she should be able to filter it.

# Mike Dunn on Sunday, January 08, 2006 4:43 PM:

I wonder if .su (Soviet Union) is in use anywhere...

# Michael S. Kaplan on Sunday, January 08, 2006 5:00 PM:

Hi Gabe --

All you have said is true. But how well do you think this implementation will assist such people?

Hi Mike --

They were at least perhaps wise enough to not include that one on the list? :-)

# Dean Harding on Sunday, January 08, 2006 5:31 PM:

The problem is that most spam coming from China or whatever don't actually use a .cn TLD. The only way we know it's from china is because the originating IP is registered there. If they were able to filter by the first Received: header, then maybe we'd have an interesting feature....

I think the best anti-spam idea in Exchange SP2 is the Sender-ID feature, where it does a DNS query on the domain in the sender's From: header and checks that the sender's IP address is registered with that domain as a valid originating IP for the domain. This stops spammers from being able to set the From: address to @hotmail.com or @somebank.com

# Michael S. Kaplan on Sunday, January 08, 2006 6:38 PM:

Same sort of problem as trying to block .us, if you ask me (since if that is the goal, it will not succeed)....

# Dean Harding on Sunday, January 08, 2006 7:33 PM:

True, blocking whole countries or regions is not that useful. But all I was saying is that just going off the domain name of the sender's email address is totally pointless because you can just put whatever you like in the From: address. At least using the sender's IP address you get a better idea of where the sender actually is, rather than just wherever the sender tells you they are.

To be honest, I think this feature is more just for marketing than anything.

Sender-ID is the only anti-spam measure that I've seen, requiring zero input from the end-user, that is at all effective. All other anti-spam measures either a/ don't work (like this one), or b/ require some input from end-users (like Bayesian filtering, which requires the end-user to train the filter).

# Nick Lamb on Sunday, January 08, 2006 11:19 PM:

Not including .su is no wiser than not including .uk, both are missing from the ISO country code list, both were supposedly migration-only, and both are in active use (new domains being issued, registry is properly maintained) regardless of what ICANN might think about it. I'm sure some spam uses .su source addresses, as no doubt do a lot of non-spam emails from Russia.

Dean, The Received: header is trivially forgeable for all untrusted hops, currently the situation is that we might, if everyone is willing to do their bit, be able to create an authenticated, trustworthy mail system by the end of the decade. That still wouldn't eliminate spam, but it would make source-identifier filters like this one actually DO something besides provide us with a laugh over Sunday breakfast.

# Dean Harding on Monday, January 09, 2006 1:00 AM:

Nick: Of course, but at least one of the hops is going to be trusted (i.e. your email server's hop) and you can do your filtering so that if the mail went through any untrusted regions then filter it out.

So, even if the spammer adds fake Received: headers, you'll still get a real Received: header with an IP address registered in China when the message hits your actual SMTP server.

Anyway, I'm not saying it's a good idea, just that it's better than basing the filtering on the TLD of the From: address, which is completely arbitary. At least Received: headers are only arbitrary up to a point...

# Behnam "ZWNJ" Esfahbod on Monday, January 09, 2006 5:26 AM:

> Those "Select All" buttons are crazy. Who on earth would want to block *all* encodings?

Of course "Select All" is Microsoft's solution to reverse your selection. Just select all and remove those which you like. ;)

# Richard Gadsden on Monday, January 09, 2006 6:26 AM:

Other two-label .uk names:

parliament.uk
police.uk
mod.uk

# Heath Stewart on Monday, January 09, 2006 1:59 PM:

I worry about what Ben Cookie stated as well. Many emails or posts, in general, are encoded with a script typically intended for languages I don't understand but are written in the Latin alphabet that comprises the first 127 characters of ASCII (at least the printable characters).

The feature as intended is interesting but I wonder if it truly checks that the content is in a language you don't understand or just encoded as such.

# Michael S. Kaplan on Monday, January 09, 2006 2:18 PM:

It is not checking content, it is checking for the overall encoding of the email.

So no worries on *that* point, Heath. :-)

# Maurits on Monday, January 09, 2006 3:40 PM:

The most interesting anti-spam feature I've seen recently (and I've seen many) is the notion of blacklisting spamvertised URLs

See www.surbl.com and www.uribl.com

go to newer or older post, or back to index or month or day