Title-Only Spam Detection Research #1

I’ve been doing a bit of work on an adjunct to Bayesian spam filtering systems.  The basic idea is to  do some follow-on analysis of email messages Bayesian systems categorize as uncertains. I implement a hashing scheme, build a corpus, then check mail Bayesian filters don’t catch against the corpus.

Preamble

Let’s get some terminology and other stuff out of the way.

Spam: unsolicited email.

Bayesian: reaching a conclusion using some uncertain elements.  The wikipedia entry explains it.

Bayesian classification systems: A strict way of describing programs like SpamBayes or SpamAssassin.

Title or title header: This is the part of the email that is displayed as the title of the email.

Body: This is the part of the email where most of the text of the email is kept.

Why I chose to work on this already-solved problem. As an email server admin, the problem isn’t ’solved.’    Bayesian systems are very good. (with some limitations) Focusing on the limitations stands the best chance of making spam filtering much better.  The notion of “standing on the shoulder of giants” applies.

What are you doing differently? First, I’m using only analyzing the title of the email.  Introducing body analysis is recreating the work of Bayesian filters.  Second, I’m using a less well known algorithm and  applying it to the spam problem.  Lastly, titles with variations on words can be reliably detected. ‘V1agra’ and ‘Viagr@’ and ‘Viagra’ are detected as similar.  The resulting corpus isn’t particularly large.

Reductionist dismisal of the work:

  • This is a naive method, and therefore it’s non-special.  I’d argue it’s a moot point. It’s less than 100% naive for a couple of reasons, all of which can easily be shouted down by someone with a higher socio-economic rank and sufficient buzz-word use.
  • I’m unqualified.   My socio-economic standing does not exceed the title researcher, programmer or scholar.  Nor is the effort mathematically spectacular.   Another person with sufficient socio-economic standing will probably implement it and garner far more attention.
  • This article isn’t sufficiently filled with cryptic vocabulary, therefore the process isn’t special. It’s unfortunate that vocabulary is used to establish social rank and therefore exclude. I chose to minimize the buzz-wordiness. The next person pursuing this method won’t advance the concepts and use far more cryptic language. They will however ill garner more attention using specialized vocabulary.

The Proposal: Detect Spam Analyzing the Email Title

Why choose to analyze only the title?

  1. Bayesian systems already analyze the email body.
  2. Bayesian systems tend to fail to detect spam with brief titles and similarly brief bodies.
  3. I want to be able to scale the solution well beyond Bayesian systems.  As some email admins already know, Bayesian email filters are extremely resource intensive when you  are hosting a large number of email accounts.

The Algorithm: Rabin-Karp

Why Rabin-Karp?

  1. Because it seems to be suited to the job of  finding strings while exhibiting the potential to scale way beyond Bayesian classification systems.
  2. It is already used as a plagarism detector.
  3. Other string searching algorithms have problems when applied to spam detection.    Most string searching algo’s are built on the assumption of discreet words.  A spammer can break the filter by confounding the definition of a ‘word.’

Build a Corpus

Step 1: I grabbed 17 spam titles and stuck them in a text file. When building the corpus, I keep all characters between A and Z and force everything into lower case.  The program then loops through possible window values and generates hash values.  I send the results into a SQLite database.  There are *much* faster ways to store the data, but this worked for me in the research phase.

Some notes about the window used to compute the hashes.  The window range I used was a minimum of 3 and a maximum of 10 characters.  What one would do with the variability is an open question.  If one uses too big a window to detect spam, then the filter will fail to detect spam.  If one uses too small a window, this may be too resource intensive.  I would think that randomly assigning the window would make it harder to break the filter.  Maybe it’s a moot point and a small window should be the standard.

Test the Classification System

To test the accuracy of classifying emails, I do a simple SQL lookup on windowed (see wikipedia reference) hashes of a title string to see if the hash exists in the database.  If it does, it gets 3 points.  The process for each email title goes something like this:

  • If the sum of points squared is less than 120 then the program keeps checking the title string for more hashes.
  • If the current sum of points squared is greater than 120, then I assume it is spam and move to the next email.   The high score results because the title is very close to what’s in the corpus.  I don’t have to search the whole subject string if the score breaks 120 sooner than the end of the string.

Test Results

title Score Corpus Info
oysterperpetual

cosmographdaytona

256 in corpus
Fwd: Cookie Booth this Sun? 0 Not spam, not in corpus
Casino’s_Best PlAyer is Welcome!! 64 Gold Best Casino : Usa Player welcome!!!!
I’m tired of viagra ads 144 Not in corpus, a title from an exasperated friend
sale visit our website today and

buy replica items cheaper

9 Not in corpus
@derall and V1codin online 64 _Percocet__Adderall_Cialis

_Viagra_Ritalin!!!

Next Step

I need a bigger body of email titles to test and add to the corpus.   If you would like to provide some, then please contact me.  The title header just needs to be in a flat file, one per line, no opening/closing quotes needed.

Another step to sanitize the subject data is to strip “the”, most prepositions, and “of the”, “in the”, “to the”, “on the”, “for the”, “and the”, “that the”, “at the”, “to be” and “in a” out of the subject string.  I don’t know what effect this would have, but it is all noise.