Protect your site from spam bots

If you run your own website, you will soon come to realise the nuisance spam bots present. They trawl your site and collect email addresses, consume your bandwidth and can cause other problems. There however is a number of ways you can prevent some of them. The process I use is outlined below:

A) Directly block certain bots using regular expressions in the .htaccess file
B) Block certain IP addresses in the .htaccess file
C) Create traps by luring bad bots into sections of the website which are explicitly disallowed in the robots.txt file. Once the bots are captured add the IP address to the .htaccess file

Essentially, this system blocks known bots that cause problems, blocks known abusive IP addresses and is proactive in identifying new bad bots or IP addresses.
To implement a similar system follow the steps below.

1. Creating our .htaccess file.
Add the content of the following to your .htaccess file in the top level directory/home of your site:

htaccess_example.txt
The .htaccess file makes use of regular expressions to block known malicious bots or bots that can consume your bandwidth for their own commercial purposes. The .htaccess file specifies a php script as an error page for 403 errors. We will use this php script to make a log of all blocked requests.

2. Creating our forbidden request error page.
Create a file called 403.php in your top level/home directory. We will use this to run a script that will make a log of all forbidden requests such as blocked spam bots or IP addresses.

Add the code from the file below to the 403.php file you created:

403.txt

3. Creating the 403 error log file.
At this point we have our .htaccess file blocking bad bots and our 403.php logging errors… but it’s logging to what? We first need to create the file specified in the 403.php script; forbidden.html. Create this file in your top level/home directory.

The file below contains the code that you should use for forbidden.html. Of course you may wish to change the styles etc.

forbidden.txt

4. Setting up the robots.txt file.
Now we can block known bots and log their access attempts. However, we need to detect new ones. We can do this by setting up a trap for them to fall into. The first thing we need to do is create a special directory. Create a directory in your top level and call it something interesting. This will be where we will place the trap script. To prevent good bots visiting this we need to disallow it from the robots.txt file:

  1. User-agent: *
  2. Disallow: /your_trap_directory_here

Bad bots have the tendency to ignore the rules of the robots.txt file and you thus can be assured you will only be blocking bad bots.

Your robots.txt file should be placed in your top level/home directory.

5. Creating the trap scripts.
The trap will be executed whenever a bad bot visits the trap directory. Create a new file called index.php in your trap directory. The script below will send you an email, add a line to the .htaccess file to prevent future access and create a log in our forbidden.html file as in step 4. Change the sections for the to and from email fields.

The code you can use for index.php is in this file: index.txt

6. Create the blacklist log file.
The trap script makes a log of all IP addresses it blocks in addition to logging each hit to forbidden.html and the .htaccess file. This is used so if needed we can simply find all blocked IP addresses rather than the many times they are denied (as in forbidden.html). Simply create a file in your top level/home directory and call it blacklist.txt

7. Setting up links to lure bots
Now we have created our trap we need to lure bots to it. To do this we need to create links in your pages. Obviously you don’t want users to follow these links so they need to be hidden. There are a number of methods to do this, and you should alternate between the various ones. In the examples below, make the necessary changes. I.e. change http://www.yoursite.com/yourtrap/ to your own trap directory and write a little interesting blurb where I have placed “something random here” for the bots to look at.

Link Examples.txt

8. Permissions and Final things.
You will need to set the trap index.php file to 644 permissions and the same with forbidden.html should you wish to view it online.

You can now test to see if your site is blocking bad bots. Make use of wannabrowser and select a useragent that’s listed from the .htaccess file. Enter in your site and press submit, the results should return the 403 page and if you browse to forbidden.html on your site you should see an entry.
If the test works your site should be ready to handle bad bots and hopefully save you from spam and lost bandwidth.

9. Possible workarounds to the system
There are of course ways a bot could get through this system. One known method is to spoof the useragent (used in the .htaccess file to block the bot) and then to obey the rules in the robots.txt file. This means the bot gets through the first level of protection and then avoids the trap. I’m currently looking into how to catch bots that do this, but it’s difficult when they mimic a human. Possible a good way is to make use of a real time blacklist (rbl) such as spamhaus. A way you could do this is to use a script like below:

  1. function isBlacklisted($ip){
  2. $services = array('xbl.spamhaus.org', '...');
  3. foreach($services as $service){
  4. $check = join('.',array_reverse(explode('.',$ip))).".$service.";
  5. if(dns_check_record($check,'A')) return true;
  6. }
  7. return false;
  8. }

In the script above you can add more rbl services to the array if you want to check from more than just spamhaus.

10. Acknowledgements
There are various sources I used when creating the system. These are the sites I used and are certainly worthwhile checking if you want to make any tweaks or get some ideas:

http://www.kloth.net/internet/bottrap.php

http://www.ahref.com/guides/technology/200009/0922piou.html

http://perishablepress.com/press/2006/01/10/stupid-htaccess-tricks/#sec9

http://www.botsense.com

http://www.aaronlogan.com/downloads/htaccess.php

http://keithdevens.com/weblog/archive/2005/Nov/14/IP.blacklist

http://www.email-policy.com/Spam-black-lists.htm

18 thoughts on “Protect your site from spam bots”

  1. I suspect it would cause a performance hit to some degree. Perhaps it would be best to only protect certain pages rather than run the script for all requests. I’m going to try to implement it on my server and see if there are any noticeable differences in performance. If i get around to doing it, I’ll post my steps here.

  2. I dont know whats going on. I set everything up right (i think) but when an ip that isn’t on the list of known bots gets blocked, no matter who tries to visit it, it gets a “500 internal server” error and it wont go away untill i delete all of the anti bot spam stuff :/

  3. Ryo,

    This is most likely caused by a problem of writing the IP address to the .htaccess file. Check your .htaccess file after you get the “500 internal server” error to see what’s going on. You may need to ensure you leave a line at the end of the .htaccess file so that when the trap script writes to the .htaccess file it doesn’t write at the end of the last line (which may have data). Writing to a line that has information on it will make the .htaccess file invalid causing your problem. The trap script appends a new line at the end rather than the start, so if there is no new line initially you may have this problem.

    Cheers,

    Michael

  4. Hi Kate,

    You shouldn’t need to repeat the process for each subdomain as long as you have done it for the main one (so that .htaccess at the top level is being written to as this will then apply for any subdomain). The only thing you might want to try is putting in more of the hidden links to lure bots, as per step 7… as this will catch bots visiting your subdomains.

    Cheers,

    Michael

  5. Michael,

    Our site has been getting spammed (a form on our site) for the last few weeks. I have tried numerous solutions to stop it so far unsuccessfully.

    I just implemented this system and from what I can see it looks excellent but I am no expert.

    I realize this is an old post so if there are any updates or new info I would greatly appreciate it.

    regards,

    – Chris

  6. I implemented the above solution, but it is not working. The spam bot that has been attacking the site is still not blocked. Infact, it never went to the trap link, but keeps accessing the same url and putting in junk in the form.

    Is there any to trap a bot that repeatedly accesses the same URL.

    Thank you

  7. Nice post to have get rid of Unknown robots.
    Those robots are with no identification while they crawling.
    The only notification is that the “unknown robot crawling”
    Thank you for get out of me from that issue

  8. Thanks for the script. I get a lot of automatic requests by bots trying to game the website’s ranking system (wget requests every freakin second), and that keeps them off 🙂

  9. Great information! I especially liked the spamhaus info. It only took me a couple minutes to integrate. Ever considered creating a plugin for WordPress blogs that effectively automates the steps?
    I’ll probably use the steps here along with some of my own coding to create a PHP blacklist package using a central database and/or blacklist file that’s contributed to and shared between all my hosted sites.
    I’ll let you know if I get it up and running.

  10. I implemented the above solution, but it is not working. The spam bot that has been attacking the site is still not blocked. Infact, it never went to the trap link, but keeps accessing the same url and putting in junk in the form.

    Is there any to trap a bot that repeatedly accesses the same URL.

    Thank you

  11. I just couldn’t depart your site prior to suggesting that I really loved the standard information a person provide for your visitors? Is gonna be again steadily in order to inspect new posts

Leave a Reply

Your email address will not be published. Required fields are marked *