Open Source Filtering Solutions and the Spam Problem
by Dinko Korunic - Senior Unix/Linux Security Specialist at InfoMAR - Monday, 16 July 2007.
Sender verify callout

SMTP callback verification or the sender verify callout is a simple way of checking whether the sender address found in the envelope is a really deliverable address or not. Unfortunately, verification probes are usually blocked by the remote ISP if they happen too often. Further, a remote MTA does not have to reject the unknown destinations (ie. Qmail MTA usually responds with "252 send some mail, iíll try my best"). To conclude: it is best to do verification per known spammer source domains which can be easily extracted from results of the other methods (such as the content analysis). The sender verification is supported in most FLOSS MTA: Postfix, Exim, Sendmail (via milter plugin), etc.

Content analysis

The content-based filtering is probably the core of most anti-spam filters available. It usually consists of several subtypes, so let us state a few. Static filtering is a type which triggers e-mail rejection on special patterns ("bad" words and phrases, regular expressions, blacklisted URI, "evil" numbers and similar) typically found in the e-mail headers or a body of an e-mail itself. False positives are quite possible with this method, so this type is best used in conjunction with policy-based systems (often named as heuristic filters) such as SpamAssassin and Policyd-weight. Such filters use the weighted results of several tests, typically hundreds of, to calculate a total score and decide if the e-mail is a spam or a ham. In this way, a failure in a single test does not necessarily decide the fate of an e-mail. At least several tests have to indicate a found spam content to accumulate the spam score enough for an e-mail to be flagged as a spam, so this results in a more reliable system. Of course, weighted/scoring type of a filter can contain all of the other filter types for its scoring methods.

The next type of the content analysis is the statistical filtering which mostly uses the naive Bayesian classifer for the frequency analysis of word occurrences in an e-mail. Such filtering, depending on an implementation, requires the initial training on an already presorted content and some retraining (albeit in much smaller scale) later on to obtain a maximum efficiency. The Bayesian filtering is surprisingly efficient and robust in all real life examples. It is implemented in the very popular SpamAssassin and DSPAM solutions, as well as Bogofilter, SpamBayes, POPFile and even in user e-mail clients such as Mozilla Thunderbird. Some implementations such as SpamAssassin use an output of other spam filtering methods for a retraining which gradually improves the hit/miss ratio. Most of the implementations (DSPAM, SpamAssassin) have a Web interface which allows a per-user view of the quarantined e-mail as well as the per e-mail retraining. It improves the quality of either the global dictionary (a database of learned tokens) or the individual per user dictionaries. DSPAM, for an instance, supports a whole range of additional features such as: combining of extracted tokens together to obtain a better accuracy, tunable classifiers, the initial training sedation, the automatic whitelisting, etc.

Another popular solution is CRM114 which is a superior classification system featuring 6 different classificators. It uses Sparse Binary Polynomial Hashing with Bayesian Chain Rule evaluation with full Bayesian matching and Markov weighting. CRM114 is both the classifier and a language. DSPAM and CRM114 are currently the two most popular and most advanced solutions in this field, and they are easily plugged into most SMTP services.

Note that plain Bayesian filters can be fooled with quite common Bayesian White Noise attacks which usually look like random nonsensical words (also known as a hashbuster) in a form of a simple poem. Such words are randomly chosen by a spammer mailer software to reflect a personal e-mail correspondence and therefore thwart the classifier. Most of the modern content analysis filters do detect such attacks - and so does SpamAssassin and DSPAM.


Harnessing artificial intelligence to build an army of virtual analysts

PatternEx, a startup that gathered a team of AI researcher from MIT CSAIL as well as security and distributed systems experts, is poised to shake up things in the user and entity behavior analytics market.

Weekly newsletter

Reading our newsletter every Monday will keep you up-to-date with security news.

Daily digest

Receive a daily digest of the latest security news.

Tue, Feb 9th