I’m working on a service using the streaming API from Twitter, a great feature as it gives you instant access to Tweets. But then you get overloaded by Tweets, and because I’m looking for Tweets talking about money, I get lots of noise.
Looking at TwitBlock, I filtered out lots of it :
- Ignore tweets from recent users (if created less than 24 hours ago)
- Ignore tweets from users with default profile image
- Ignore if fewer than 10 followers
- Ignore if user description and name are blank
- Ignore if followers fewer than 100, and friends count is > (2*followers_count)
- Ignore if followers count over 100, and friends count is > (5*followers_count)
- Ignore if the user sends more than 20 tweets per day in average, since its creation
Some of these is working for me, but might not work for you at all. However, I think there could be a better way for a mass use. I ran mail servers for years, a very reliable way to handle this is to use DNSBL (also called RBL). You could have different RBLs for different use, and any twitter client could implement this very easily. Please note this could probably not work for Direct Messages except if Twitter grant specific access to the service, which they would probably never will.

Hi Fabien,
I think doing that would be a very clever way of doing it. I don’t think it would be too resource friendly though (perhaps that could be ironed out though). I’m now interested in linkspam and whether or not the entries often show a high level of spam.
Scott.
Scott Wilcox
28 Feb 10 at 2:39 am
Scott,
If the service had lots of success, it could be faster to just fetch all tweets through the streaming API and generate spam score for every tweet, but that would be *very* cpu consuming of course. Only good if such a service has a huge success.
You could imagine having a very short timeout for the connection from twitterbl.com -> twitter.com and generate a short TTL for this entry ( I don’t know if DNS allows this), return an empty JSON TXT, meaning the Twitter client has to request it again later (within seconds).
With an async DNS server based on node.js, maybe it could be fast and not too resource consuming.
There is probably ideas to copy from Akismet, Spamassassin, and others?
Fabien Penso
28 Feb 10 at 4:31 am
We’re working on that actually. If you would like to join us
Nicolas
28 Feb 10 at 12:22 pm