Spamgraph

From BoykinWiki

Jump to: navigation, search

This page contains information about the work on graph-based spam filtering. There are some links to code which might help others reproduce our work.

This work was published in IEEE Computer. The title of the paper is Leveraging Social Networks to Fight Spam. You can find an earlier pre-print on the Arxiv titled Personal Email Networks: An Effective Anti-Spam Tool.

Contents

Conferences

Here are the slides from the talk.

The original paper, Personal Email Networks: An Effective Anti-Spam Tool, was presented at Spamconference. Photos from the 2005 Spamconference are online.

The second work, Let Your CyberAlter Ego Share Information and Manage Spam will be presented by Joseph Kong at Second Conference on Email and Anti-Spam.

Reproducing our results

(if you attempt this, please edit this page to include better instructions)

Originally I used perl to parse the mbox files and create a graph that could be read by my C++ code (Netmodeler and the included accompanying program spamgraph). Once perl produced the graph, spamgraph would classify each sender as a spammer, non-spammer, or unknown. It would write out a blacklist and whitelist. Then, again with perl, the accuracy of the blacklists and whitelists would be tested. The programs were:

The perl code can be downloaded here. I would like to warn you, while I pride myself on being a good programmer, these scripts are total hack jobs that were created to produce a paper and never intended to be read by others, so beware.

  • make_graph.pl (parse the email and produce the netmodeler graph)
  • spamgraph (C++ program) (read the netmodeler graph and classify nodes/edges)
  • list_check.pl (use the blacklist/whitelist made by spamgraph to check accuracy against hand sorted mbox files)

For the final paper submission to IEEE Computer, we wanted to include the results of using this algorithm to train CRM114. For this I wrote the python code. Most of the python code is crm114 related.

For the python version, I did not implement the "greylist splitting" technique using edge betweenness, this is done in the C++ version.

The bottom line is, you will probably need to write some code yourself, or follow exactly in my steps (using perl then C++) if you want to repeat the experiment. The python code does not have 100% of the algorithms in place (since it was only used for the CRM114 part).

Related Work

Related Research Projects

Individuals or research groups working on related approaches should feel free to add a link to their cite in this section.