Title-Only Spam Detection Research #1
I’ve been doing a bit of work on an adjunct to Bayesian spam filtering systems. The basic idea is to do some follow-on analysis of email messages Bayesian systems categorize as uncertains. I implement a hashing scheme, build a corpus, then check mail Bayesian filters don’t catch against the corpus.
Preamble
Let’s get some terminology and other stuff out of the way.
Spam: unsolicited email.
Bayesian: reaching a conclusion using some uncertain elements. The wikipedia entry explains it.
Bayesian classification systems: A strict way of describing programs like SpamBayes or SpamAssassin.
Title or title header: This is the part of the email that is displayed as the title of the email.
Body: This is the part of the email where most of the text of the email is kept.
Why I chose to work on this already-solved problem. As an email server admin, the problem isn’t ’solved.’ Bayesian systems are very good. (with some limitations) Focusing on the limitations stands the best chance of making spam filtering much better. The notion of “standing on the shoulder of giants” applies.
What are you doing differently? First, I’m using only analyzing the title of the email. Introducing body analysis is recreating the work of Bayesian filters. Second, I’m using a less well known algorithm and applying it to the spam problem. Lastly, titles with variations on words can be reliably detected. ‘V1agra’ and ‘Viagr@’ and ‘Viagra’ are detected as similar. The resulting corpus isn’t particularly large.
Reductionist dismisal of the work:
- This is a naive method, and therefore it’s non-special. I’d argue it’s a moot point. It’s less than 100% naive for a couple of reasons, all of which can easily be shouted down by someone with a higher socio-economic rank and sufficient buzz-word use.
- I’m unqualified. My socio-economic standing does not exceed the title researcher, programmer or scholar. Nor is the effort mathematically spectacular. Another person with sufficient socio-economic standing will probably implement it and garner far more attention.
- This article isn’t sufficiently filled with cryptic vocabulary, therefore the process isn’t special. It’s unfortunate that vocabulary is used to establish social rank and therefore exclude. I chose to minimize the buzz-wordiness. The next person pursuing this method won’t advance the concepts and use far more cryptic language. They will however ill garner more attention using specialized vocabulary.
The Proposal: Detect Spam Analyzing the Email Title
Why choose to analyze only the title?
- Bayesian systems already analyze the email body.
- Bayesian systems tend to fail to detect spam with brief titles and similarly brief bodies.
- I want to be able to scale the solution well beyond Bayesian systems. As some email admins already know, Bayesian email filters are extremely resource intensive when you are hosting a large number of email accounts.
The Algorithm: Rabin-Karp
Why Rabin-Karp?
- Because it seems to be suited to the job of finding strings while exhibiting the potential to scale way beyond Bayesian classification systems.
- It is already used as a plagarism detector.
- Other string searching algorithms have problems when applied to spam detection. Most string searching algo’s are built on the assumption of discreet words. A spammer can break the filter by confounding the definition of a ‘word.’
Build a Corpus
Step 1: I grabbed 17 spam titles and stuck them in a text file. When building the corpus, I keep all characters between A and Z and force everything into lower case. The program then loops through possible window values and generates hash values. I send the results into a SQLite database. There are *much* faster ways to store the data, but this worked for me in the research phase.
Some notes about the window used to compute the hashes. The window range I used was a minimum of 3 and a maximum of 10 characters. What one would do with the variability is an open question. If one uses too big a window to detect spam, then the filter will fail to detect spam. If one uses too small a window, this may be too resource intensive. I would think that randomly assigning the window would make it harder to break the filter. Maybe it’s a moot point and a small window should be the standard.
Test the Classification System
To test the accuracy of classifying emails, I do a simple SQL lookup on windowed (see wikipedia reference) hashes of a title string to see if the hash exists in the database. If it does, it gets 3 points. The process for each email title goes something like this:
- If the sum of points squared is less than 120 then the program keeps checking the title string for more hashes.
- If the current sum of points squared is greater than 120, then I assume it is spam and move to the next email. The high score results because the title is very close to what’s in the corpus. I don’t have to search the whole subject string if the score breaks 120 sooner than the end of the string.
Test Results
| title | Score | Corpus Info |
| oysterperpetual
cosmographdaytona |
256 | in corpus |
| Fwd: Cookie Booth this Sun? | 0 | Not spam, not in corpus |
| Casino’s_Best PlAyer is Welcome!! | 64 | Gold Best Casino : Usa Player welcome!!!! |
| I’m tired of viagra ads | 144 | Not in corpus, a title from an exasperated friend |
| sale visit our website today and
buy replica items cheaper |
9 | Not in corpus |
| @derall and V1codin online | 64 | _Percocet__Adderall_Cialis
_Viagra_Ritalin!!! |
Next Step
I need a bigger body of email titles to test and add to the corpus. If you would like to provide some, then please contact me. The title header just needs to be in a flat file, one per line, no opening/closing quotes needed.
Another step to sanitize the subject data is to strip “the”, most prepositions, and “of the”, “in the”, “to the”, “on the”, “for the”, “and the”, “that the”, “at the”, “to be” and “in a” out of the subject string. I don’t know what effect this would have, but it is all noise.
