Spam Filtering with gzip

Monday, 27 January 2003, 2:51 PM EST

Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.

A related technique allows us to measure how much a given, "test" text has in common with a corpus of possibly similar documents. If we concatenate the corpus and the test text, and gzip them together, the test text will get a better compression ratio if it has more fragments, words, or phrases in common with the corpus, and a worse ratio if it is dissimilar. Since the LZ algorithm scans the entire input for repetitions, it tends to map pieces of the test text to previous occurrences in the corpus, thereby achieving a high "appended compression ratio" if the test text is similar to what it's appended to.

In this case, we wish to compare an incoming email message against two possible corpora: spam and non-spam (ham). If we maintain archives of both, we can compare the appended compression ratios relative to each, to judge how similar a new message is to spam or ham.

[ Read more ]

Related items




Spotlight

Android Fake ID bug allows malware to impersonate trusted apps

Posted on 29 July 2014.  |  Bluebox Security researchers unearthed a critical Android vulnerability which can be used by malicious applications to impersonate specially recognized trusted apps - and get all the privileges they have - without the user being none the wiser.


Weekly newsletter

Reading our newsletter every Monday will keep you up-to-date with security news.
  



Daily digest

Receive a daily digest of the latest security news.
  

DON'T
MISS

Tue, Jul 29th
    COPYRIGHT 1998-2014 BY HELP NET SECURITY.   // READ OUR PRIVACY POLICY // ABOUT US // ADVERTISE //