If you run your own website, you will soon come to realise the nuisance spam bots present. They trawl your site and collect email addresses, consume your bandwidth and can cause other problems. There however is a number of ways you can prevent some of them. The process I use is outlined below:

A) Directly block certain bots using regular expressions in the .htaccess file
B) Block certain IP addresses in the .htaccess file
C) Create traps by luring bad bots into sections of the website which are explicitly disallowed in the robots.txt file. Once the bots are captured add the IP address to the .htaccess file

Essentially, this system blocks known bots that cause problems, blocks known abusive IP addresses and is proactive in identifying new bad bots or IP addresses.
To implement a similar system follow the steps below.

1. Creating our .htaccess file.
Add the content of the following to your .htaccess file in the top level directory/home of your site:

htaccess_example.txt
The .htaccess file makes use of regular expressions to block known malicious bots or bots that can consume your bandwidth for their own commercial purposes. The .htaccess file specifies a php script as an error page for 403 errors. We will use this php script to make a log of all blocked requests.

2. Creating our forbidden request error page.
Create a file called 403.php in your top level/home directory. We will use this to run a script that will make a log of all forbidden requests such as blocked spam bots or IP addresses.

Add the code from the file below to the 403.php file you created:

403.txt

3. Creating the 403 error log file.
At this point we have our .htaccess file blocking bad bots and our 403.php logging errors… but it’s logging to what? We first need to create the file specified in the 403.php script; forbidden.html. Create this file in your top level/home directory.

The file below contains the code that you should use for forbidden.html. Of course you may wish to change the styles etc.

forbidden.txt

4. Setting up the robots.txt file.
Now we can block known bots and log their access attempts. However, we need to detect new ones. We can do this by setting up a trap for them to fall into. The first thing we need to do is create a special directory. Create a directory in your top level and call it something interesting. This will be where we will place the trap script. To prevent good bots visiting this we need to disallow it from the robots.txt file:

  1. User-agent: *
  2. Disallow: /your_trap_directory_here

Bad bots have the tendency to ignore the rules of the robots.txt file and you thus can be assured you will only be blocking bad bots.

Your robots.txt file should be placed in your top level/home directory.

5. Creating the trap scripts.
The trap will be executed whenever a bad bot visits the trap directory. Create a new file called index.php in your trap directory. The script below will send you an email, add a line to the .htaccess file to prevent future access and create a log in our forbidden.html file as in step 4. Change the sections for the to and from email fields.

The code you can use for index.php is in this file: index.txt

6. Create the blacklist log file.
The trap script makes a log of all IP addresses it blocks in addition to logging each hit to forbidden.html and the .htaccess file. This is used so if needed we can simply find all blocked IP addresses rather than the many times they are denied (as in forbidden.html). Simply create a file in your top level/home directory and call it blacklist.txt

7. Setting up links to lure bots
Now we have created our trap we need to lure bots to it. To do this we need to create links in your pages. Obviously you don’t want users to follow these links so they need to be hidden. There are a number of methods to do this, and you should alternate between the various ones. In the examples below, make the necessary changes. I.e. change http://www.yoursite.com/yourtrap/ to your own trap directory and write a little interesting blurb where I have placed “something random here” for the bots to look at.

Link Examples.txt

8. Permissions and Final things.
You will need to set the trap index.php file to 644 permissions and the same with forbidden.html should you wish to view it online.

You can now test to see if your site is blocking bad bots. Make use of wannabrowser and select a useragent that’s listed from the .htaccess file. Enter in your site and press submit, the results should return the 403 page and if you browse to forbidden.html on your site you should see an entry.
If the test works your site should be ready to handle bad bots and hopefully save you from spam and lost bandwidth.

9. Possible workarounds to the system
There are of course ways a bot could get through this system. One known method is to spoof the useragent (used in the .htaccess file to block the bot) and then to obey the rules in the robots.txt file. This means the bot gets through the first level of protection and then avoids the trap. I’m currently looking into how to catch bots that do this, but it’s difficult when they mimic a human. Possible a good way is to make use of a real time blacklist (rbl) such as spamhaus. A way you could do this is to use a script like below:

  1. function isBlacklisted($ip){
  2. $services = array('xbl.spamhaus.org', '...');
  3. foreach($services as $service){
  4. $check = join('.',array_reverse(explode('.',$ip))).".$service.";
  5. if(dns_check_record($check,'A')) return true;
  6. }
  7. return false;
  8. }

In the script above you can add more rbl services to the array if you want to check from more than just spamhaus.

10. Acknowledgements
There are various sources I used when creating the system. These are the sites I used and are certainly worthwhile checking if you want to make any tweaks or get some ideas:

http://www.kloth.net/internet/bottrap.php

http://www.ahref.com/guides/technology/200009/0922piou.html

http://perishablepress.com/press/2006/01/10/stupid-htaccess-tricks/#sec9

http://www.botsense.com

http://www.aaronlogan.com/downloads/htaccess.php

http://keithdevens.com/weblog/archive/2005/Nov/14/IP.blacklist

http://www.email-policy.com/Spam-black-lists.htm