COPS and HPDC Best Paper Awards

The paper I presented at the COPS 2008 Workshop, “Social VPNs: Integrating Overlay and Social Networks for Seamless P2P Networking” won the workshop’s best paper award. This was joint work with Renato Figueiredo, Pierre St. Juste, and David Wolinsky.

The SocialVPN project uses social networking to build P2P VPNs. It makes it trivial to communicate, play games, share music, etc. with anyone in your social network with any IP-compatible software. You can find the slides here. Below, you will see a photo of me giving the talk:

Giving the SocialVPN Talk

In addition, a related work: “Improving Peer Connectivity in Wide-area Overlays of Virtual Workstations” by Arijit Ganguly, David Wolinsky, and Renato Figueiredo. and myself was presented at HPDC 2008 by Arijit Ganguly. This work also won the best paper award.

Congratulations to my co-authors!

Why Spam Filters Suck

I was asked by Brendan I. Koerner why spam filters suck. An excerpt from my answer appears in this month’s Wired magazine.

One thing they didn’t mention which I think bears repeating is why spam filtering is so hard. Here is what I said:

Spam filtering is an adversarial problem. You want your email spam free, but
spammers want you to read their messages. Most technological problems don’t
have this aspect. The spam problem is more similar to automated crime
fighting than building a better search engine, for instance. As text spam
filters have increased in accuracy we’ve seen spammers move to image (jpg and
gif) and pdf based spam.

Spam classification is hard. If you hired a human to do the job, you may not
get better results. Certainly some messages are easier for a human to
classify than a computer, but the reverse is also true. Bill Yerazunis,
author of the spam classifier CRM114, estimated his own classification
accuracy as 99.84%, which was worse than his software did:

http://crm114.sourceforge.net/wiki/doku.php?id=news

I suppose I was asked due to my research in spam filtering based on social email networks, work which has been in the news before (Slashdot, New Scientist, Nature News).

Brunet: a remarkably great P2P library

While I was a post-doc at UCLA, I started work on a library to implement P2P protocols, which we called Brunet. Unfortunately, Brunet has never been publicized as well as it should have been. This post is an attempt to bring some attention to it.

Brunet is a Free Software (GPL licensed) library for P2P networking written in C# and developed using Mono, but it also runs on Microsoft’s .Net platform. Here are the basic features of Brunet:

  1. Network transports are abstracted as Edge objects. We have TCP, UDP and TLS/SSL transports now.
  2. Completely distributed UDP NAT traversal (TCP NAT traversal is a to-do item).
  3. Implementation of Chord/Kleinberg/Symphony type ring topology for routing.
  4. DHT implementation.
  5. Both packet based and RPC based communication primitives. New protocols
    are easy to add using one or the other of these approaches.
  6. XML-RPC bridge to allow calling or serving RPC methods with XML-RPC clients or servers. We have examples of this using standard Python.
  7. Distributed tunneling system to maintain topology in the presence of some routing difficulties (untraversable NATs, BGP outages, firewalls, etc.).

We have deployed Brunet on PlanetLab for over two years. You can see information on some of the running nodes on our node statistics page.

Brunet is used as the basis for our IPOP (IP over P2P) project. IPOP builds a virtual IP network which allows nodes to communicate even if some of them are behind firewalls or NATs (as is often the case these days).

The WOW/Grid Appliance project uses IPOP and Brunet’s DHT to produce self-managing grids for large-scale, wide-area grid computing. You can download images of the grid appliance for many virtual machines/hypervisors to try them out yourself.

We encourage outside use of any of the above projects and we welcome code contributions to Brunet. If you are interested in any of the above projects, please join our ACISP2P Google Group. You can download the Brunet code using Mercurial. You can also browse the Brunet documentation online.

Publications

:

  1. P. Oscar Boykin, Jesse S. A. Bridgewater, Joseph S. Kong, Kamen M. Lozev, Behnam A. Rezaei, Vwani P. Roychowdhury, A Symphony Conducted by Brunet, arXiv:0709.4048v1
  2. A. Ganguly, A. Agrawal, P. O. Boykin, R. J. Figueiredo, Wow: Self-Organizing Wide Area Overlay Networks Of Virtual Workstations, Journal of Grid Computing, Vol. 5, No. 2, pp. 151-172 2007. [bib]
  3. Arijit Ganguly, Abhishek Agrawal, P. Oscar Boykin and Renato Figueiredo, “IP over P2P: Enabling Self-Configuring Virtual IP Networks for Grid Computing”, In Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rhodes Island, Greece. [ppt talk, pdf talk]
  4. Arijit Ganguly, Abhishek Agrawal, P. Oscar Boykin, Renato Figueiredo, “WOW: Self-Organizing Wide Area Overlay Networks of Virtual Workstations”, In Proceedings of the 15th IEEE International Symposium on High Performance Distributed Computing (HPDC), pages 30-41. Paris.
  5. Arijit Ganguly, David Wolinsky, P. O. Boykin, Renato Figueiredo, “Decentralized Dynamic Host Configuration in Wide-Area Overlay Networks of Virtual Workstations”. In Workshop on Large-Scale and Volatile Desktop Grids (PCGrid), 03/2007

Finding Bugs

The WOW project uses the Brunet P2P library developed in C# using Mono.

Recently, we noticed a memory leak on our nodes on planetlab. David Wolinsky identified exactly the bit of code causing the leak and reported a bug. The bug was fixed within a day and the SVN has the fix already.

Kudos to David for tracking down the bug (notice many people on the list said there was probably not really a bug). Kudos also to the Mono team for getting a fix in after the issue was clearly identified. This is an example of the power of Free Software. After finding the bug, David was able to build a version of mono that same day that had the bug fixed which we could deploy on planetlab. Try that with proprietary software.

Percolation Search on Complex Networks

The journal version of the percolation search paper, entitled Scalable Percolation Search on Complex Networks has appeared in the Journal of Theoretical Computer Science. This is joint work with Nima Sarshar and Vwani Roychowdhury.

IPOP: IP over a P2P overlay.

A paper using P2P overlays for virtual IP networks will be presented at IPDPS 06 next month. This is joint work with Professor Renato Figueiredo and students Arijit Ganguly and Abhishek Agrawal.

The main problem we are addressing is virtual networks for grid computing. We want to enable virtual nodes (compute nodes running on Xen or VMWare to migrate across the physical network and keep all their virtual network connections active. We want nodes be able to be hosted behind NAT/Firewalls without changing any existing network policies (including reconfiguring the firewalls). We also want the network to self organize so that new nodes can be added with no administration of the virtual network. IPOP is our project to meet these goals.

IPOP uses the Brunet P2P library. Each node has a Brunet address which enables all other nodes to route packets to them. Brunet automatically handles NAT traversal (without any addiitonal servers or specially designated third parties) and makes sure all pairs of nodes can communicate, regardless of any firewalls and NATs which may be between them. Secondly, since Brunet addresses are independent of the underlying network addresses (such as TCP/IP or UDP/IP which may be transporting Brunet), the node can move to a different IP address, but keep its same Brunet address. This application layer mobility allows migration of virtual machines without assigning new virtual IP addresses.

This project views P2P primarily as paradigm of self-organization to build maintainence free information systems. We are also interested in using the IPOP/Brunet infrastructure to provide effient scheduling of grid resources and self-reconfiguration of grid resources. A second IPOP paper covering our initial work on migration and reconfiguration will appear in HPDC 06.

Disaster Management in Scale-Free Networks

My co-authors and I have posted a new pre-print to the Arxiv:
Disaster Management in Scale-Free Networks: Recovery from and Protection Against Intentional Attacks.

This paper studies the effect of continuous attacks on scale-free networks (such as P2P networks), and how to recover from an attack. Since P2P systems are often billed as robust, it is important to consider not only random failures, but also attacks. We see that if the network is reactive, we can recover almost instantly from attacks.

Functionality from Network Structure

A new preprint is on the archive: Functionality Encoded In Topology? Discovering Macroscopic Regulatory Modules from Large-Scale Protein-DNA Interaction Networks. This is joint work with Riccardo Boscolo, Behnam A. Rezaei, and Vwani P. Roychowdhury. Sadly, my name is wrong on the current draft of the paper. Hopefully that will be fixed soon.

The paper basically discusses the results of applying a community finding algorithm to a Protein-DNA regulatory network. Interestingly, a hierarchical structure is revealed that corresponds to known functionality.

SpamConference Wrap-up

I am back from the SpamConference. Slashdot has a thread on the SpamConference also. I had a lot fun, and just missed the bad weather.

Of the talks, I was particularly interested in (listed in order of presentation):

Finally, I got to meet many interesting people, including Meng Wong, founder of Pobox.com (a service I have used since mid 1995). Recently he has been helping the world with his work on SPF, Bill Yerazunis author of CRM114, as well as all of the speakers.

From the SpamConference

I just finished my talk at SpamConference 2005. Look for this work to appear in IEEE Computer soon. In the mean time, find my slides here, and the older preprint on the arxiv (it does not include results about bootstrapping).

Someone asked about my figures from the slide. You can find them here.

Spam Conference

I am speaking at the MIT Spam Conference on my network based spam classifier. This work will also be appearing in an upcoming issue of Computer.

Balanced Overlay Networks

My co-author Jesse Bridgewater just submitted our new paper “Balanced Overlay Networks: Decentralized Load Balancing via Self-Organized Random Networks” to the Arxiv. This paper is an improved version of the algorithm from the paper “A Statistical Mechanical Load Balancer for the Web”. In the current paper, we use a much more effective algorithm, however it is one for which we have not yet analytically solved the dynamics.

This new algorithm works basically the same way as the previous: nodes accept jobs from other nodes, and they keep an in-degree (on a directed graph) proportional to their unused resources. In the current paper, we show that by taking a short random walk on the graph, and sending a job to the least loaded node on the walk, we get an almost perfectly balanced graph: the degree distribution is almost a delta function.

This could be very useful for web mirrors of such software as Debian. If there was a small Apache module which kept this network, Apache servers could easily redirect clients to nodes on the network which are less loaded. This would make for near optimal usage of mirror bandwidth.

Load Balancing

My coauthors (Jesse Bridgewater and Vwani Roychowdhury) and I just got a new preprint on the Arxiv: “A Statistical Mechanical Load Balancer for the Web”.

This work gives a decentralized randomized algorithm for finding nodes to which load may be transfered. In the process, we show how the algorithm increases entropy of the graph, and results in the formation of an Erdos-Renyi random graph. The paper describes a version of the algorithm that we can analyze: namely that nodes are discovered by taking the last node on a random walk. If we instead select the least loaded node on a random walk the results improve quite a lot. This will be the subject of a future work.

Percolation Search on Slashdot

Slashdot has found the percolation search result.

It is amazing how little the people who comment bother to read. This work is really about showing that global P2P searches can be scalable and giving such a scalable algorithm.

I like the conspiracy theory comment. If this guy is correct, Google must have failed to get in touch with me. Where are my Google stock options?

Percolation Search getting attention

It seems the percolation search is getting some attention. It won the best paper award at IEEE P2P2004. TRN Magazine is running a story on our percolation search algorithm.

The algorithm itself is really simple and could greatly improve systems like Gnutella and Kazaa immediately, but it also has more applications. Percolation theory is a really fun topic. In this paper, it is exciting to see this theory harnessed to greatly improve scaling of distributed search.

Percolation Search

A work with co-authors Nima Sarshar and Vwani Roychowdhury has finally reached the Arxiv: Scalable Percolation Search in Power Law Networks. It has also been accepted to P2P2004, the IEEE conference on P2P.

This is an exciting paper. It shows that flooding on unstructured P2P networks is not neccesary to find all content. In fact, one can send queries to each neighbor with a probability which approaches zero as the network grows, and still find all the content. Total communications complexity can be made to scale like O(log(N)) or O(sqrt(N)) for a network of size N (depending on the setting). This compares to O(N) in existing systems.

I hope this work gets some attention from the practical P2P community. It could really decrease the bandwidth required for P2P systems.

Free Dendrograms

I have been interested in making Dendrograms, but I have not seen any Free Software which can do this. So, like any hacker, I scratched an itch.

I wrote Dendro.py, and added it to the Netmodeler CVS. It takes a simple text format of the dendrogram and spits out code which will make Octave draw the dendrogram.

As an example, here is a picture of a graph, here are the nm and dot versions of it. Here is the text represenation of the dendrogram. Finally, behold the dendrogram given by a clustering algorithm:

Fun stuff.