Anti-spam voting system

Bennett Haselton, 9/1/2003

Criteria than an anti-spam system should meet

This is my proposal for a very simple algorithm that I think would effectively stop all spam for large providers like HotMail, while allowing legitimate (opt-in) newsletters to be delivered to users. The only limitation is that it would only work for large providers like HotMail, with many user accounts. It would not work as an effective anti-spam algorithm for a small, local ISP. (The reason is because the algorithm assumes that if one user on the system gets a particular piece of spam, then a significant number of other users on the system will receive the same spam.)

Despite this limitation, it is the only algorithm I know of, that meets all of the following criteria:

Here is how the algorithm works

All incoming messages arriving at, say, HotMail, are categorized in "first-tier" and "second-tier" groups. Messages are grouped into first-tier groups based on the IP address that connected to HotMail's server to deliver the message; all messages delivered from the same IP address are considered part of the same first-tier group. Second-tier groups are determined by examining the mail headers and looking at the second-to-last IP address that relayed the mail before passing it on to the machine that finally delivered the message to HotMail. So if machine A delivers 20 messages to HotMail, and the headers indicate that 10 of those messages were relayed from machine B and 10 of those messages were relayed from machine C, then the messages would comprise two second-tier groups of 10 messages each, and all 20 messages would be in the same first-tier group.

Of course, if machine A is an untrustworthy machine operated by a spammer, then in each message delivered by machine A, it could forge the IP address used to relay the mail through machine A, and it could make up a different IP address for every message that it sends, so that they would all be in different second-tier groups. But they would still all be in the same first-tier group, because they were all relayed directly from machine A.

Now, for each group of messages (including both first-tier and second-tier groups), HotMail keeps track of: (a) how many messages in that group have been viewed by HotMail users, and (b) of those messages, how many have been reported as "spam", by users viewing the message and clicking the "Report as spam" button.

Suppose, using example numbers, that if a newsletter is spam, then 10% of HotMail users will report it as spam. But if a newsletter is not spam, then only 1% of HotMail users will (accidentally or maliciously) report it as spam.

Once HotMail detects that, say, 100 of the messages in a given group have been read by users, then it looks at the number that have been flagged as spam. If the number is around 1%, then HotMail can assume that the newsletter probably isn't spam. If the number is around 10%, then HotMail can assume that the newsletter is spam, and for all remaining users that have received a message in that group, the email is marked as spam and moved into their "Junk Mail" folder. Thus if a spammer sends a message to 10,000 HotMail addresses, only 100 of those HotMail users will have to see the message before HotMail determines that all messages in that group are spam, and prevent the other 9,900 users from seeing it.

Effects of the algorithm

Here is how the algorithm would play out in specific scenarios:

Feedback

Send any comments to bennett@peacefire.org -- thanks!