Blog them all

Fabien Penso’s blog

How to filter SPAMs on Twitter efficiently : TwitteRBL ?

3 comments

I’m working on a service using the streaming API from Twitter, a great feature as it gives you instant access to Tweets. But then you get overloaded by Tweets, and because I’m looking for Tweets talking about money, I get lots of noise.

Looking at TwitBlock, I filtered out lots of it :

  1. Ignore tweets from recent users (if created less than 24 hours ago)
  2. Ignore tweets from users with default profile image
  3. Ignore if fewer than 10 followers
  4. Ignore if user description and name are blank
  5. Ignore if followers fewer than 100, and friends count is > (2*followers_count)
  6. Ignore if followers count over 100, and friends count is > (5*followers_count)
  7. Ignore if the user sends more than 20 tweets per day in average, since its creation

Some of these is working for me, but might not work for you at all. However, I think there could be a better way for a mass use. I ran mail servers for years, a very reliable way to handle this is to use DNSBL (also called RBL). You could have different RBLs for different use, and any twitter client could implement this very easily. Please note this could probably not work for Direct Messages except if Twitter grant specific access to the service, which they would probably never will.

twitterbl

Written by Fabien Penso

February 28th, 2010 at 2:25 am

Posted in computer

Tagged with , ,

3 Responses to 'How to filter SPAMs on Twitter efficiently : TwitteRBL ?'

Subscribe to comments with RSS or TrackBack to 'How to filter SPAMs on Twitter efficiently : TwitteRBL ?'.

  1. Hi Fabien,

    I think doing that would be a very clever way of doing it. I don’t think it would be too resource friendly though (perhaps that could be ironed out though). I’m now interested in linkspam and whether or not the entries often show a high level of spam.

    Scott.

    Scott Wilcox

    28 Feb 10 at 2:39 am

  2. Scott,

    If the service had lots of success, it could be faster to just fetch all tweets through the streaming API and generate spam score for every tweet, but that would be *very* cpu consuming of course. Only good if such a service has a huge success.

    You could imagine having a very short timeout for the connection from twitterbl.com -> twitter.com and generate a short TTL for this entry ( I don’t know if DNS allows this), return an empty JSON TXT, meaning the Twitter client has to request it again later (within seconds).

    With an async DNS server based on node.js, maybe it could be fast and not too resource consuming.

    There is probably ideas to copy from Akismet, Spamassassin, and others?

    Fabien Penso

    28 Feb 10 at 4:31 am

  3. We’re working on that actually. If you would like to join us

    Nicolas

    28 Feb 10 at 12:22 pm

Leave a Reply