This package will enable your web site to automatically ban bad web robots (aka web spiders) that ignore the robots.txt file. This does not include Googlebot and other well-behaved robots. The software requirements are Apache and PHP, but the principles would work with any web server setup.
Three of the most common bad robots are:
This package will protect against email-harvesting robots whether they follow robots.txt or not:
Update, December 2006:
So many people have started using bot-trap and other bad robot banners that many email harvesting robots appear to be following robots.txt! This means that simply placing your contact page in robots.txt as I do will drastically cut the number of spammers that get your email, even if you don't run bot-trap, and even if you use a direct mailto: link. A partial victory!
If you speak German, there is a nice repackaged version of bot-trap with a few additions at www.spider-trap.de.
To see bot-trap in action, go to the page on my site where bad robots go: http://danielwebb.us/bot-trap/index.php. You'll be banned for going where you weren't supposed to go (didn't you read the robots.txt file!?) Then go back to this page with the back button or type in the main page URL in the URL bar of your browser. Reload the page or you'll probably get the old page cached by your browser. The 403 page will allow you to unban yourself. Bad robots shouldn't be going to that link, because my robots.txt forbids it. You were a very bad robot indeed.
It is possible that someone is banned who shouldn't be. Perhaps a previous user of an IP address in a DHCP pool was a naughty user and ran a bad bot, but now the new user is banned. Not to worry, the custom "403 Forbidden" page allows any user to unban themselves by typing a requested word into a form box. Real people can easily do this, but bots can't!
<!-- Bad robot trap: Don't go here or your IP will be banned! --> <a href="/bot-trap/index.php"><img src="bot-trap/pixel.gif" border="0" alt=" " width="1" height="1"/></a>
Warning! Don't mess with this if you don't have the ability to fix things if this breaks them! If you mess up /.htaccess, your whole site could go down.
The directory for bot-trap is hard-coded as /bot-trap. To change this, you have to change all the instances of '/bot-trap' to your new directory.
I used the file locking mechanism flock(), and according to the PHP page comments on flock, you're bound to get a race condition eventually when two processes set the bot trap at the same time or two processes unban at the same time. Unfortunately, if the comments there are to be believed (and I don't know one way or the other), there is no way around this. I guess I figure if you have so many bots tripping the trap that you get race collisions, you've got bigger problems than race collisions.
bot-trap is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
bot-trap is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
I can be contacted at danielwebb.us.
This package was inspired by http://www.kloth.net/internet/bottrap.php.
Bot-trap looks old. Is it being updated?
Even good bots such as Googlebot are sometimes temporarily broken and ignore robots.txt, how do I keep them from getting banned?
Add IP addresses of known good bots to the top of the .htaccess file with the "Allow" keyword:Allow from 184.108.40.206/18whitelists some of the Google IP address space. WARNING!Google has a service called Google Wireless Transcoder (GWT) that may allow spammers to ignore bot-trap via GWT if you whitelist Google! See this post at webmasterworld for more info. Bot-trap now shows the hostname in the alert email, so you can not whitelist Google but check your email alerts occasionally to make sure someone important like Google or Yahoo hasn't been banned.
Allow from 220.127.116.11/19
Why does my local machine keep getting banned when I use link-checkers?
Link-checkers may ignore robots.txt since they want to check the whole site. See the previous question and add:Allow from 127.0.0.1to whitelist your local machine.
Can't a bad bot get around bot-trap by just following robots.txt?
Yes. Bot-trap was originally designed to stop email harvesters. For this task bot-trap is extremely good. Another kind of bad bot is the scraper, where the bot copies all your pages and content to their own site to get ad revenue. If these scum follow robots.txt, bot-trap is powerless against them, and you'll need a different method to defend yourself.
What's to stop a bot on a rotating IP from filling your .htaccess with thousands or millions of "banned" user-agents just to get back at you?
I'm not sure. I think bot-trap is probably underpowered for this kind of determined scumbag. Another bot trapping implementation that would be more immune to this type of attack is this mod_perl implementation. Instead of banning IP addresses by altering .htaccess, it has the Apache access routine call a Perl handler which holds the banned IP addreses in memory. Each time the Perl handler is restarted the ban list is cleared.
Bot-trap uses a hidden link behind a 1x1 pixel gif. Won't this incur some hidden link penalty with search engines?
Probably. Bot-trap uses 1x1 image links, which could theoretically hurt your site's rank in search engines because link-farmers often use this technique. A workaround is to create a 20x10 image with "Bot trap do not click", or create a text link that indicates it should not be clicked. Curious users will just get banned and then unban themselves. I have been using bot-trap since 2004 and I currently (December 2006) have the 4th result for the 3 terms "bad", "web", and "bot" in Google, so make of that what you will. All of my pages have two hidden links per page and most also have a PageRank of 4, which I think is respectable for a very low-traffic site. Search engine ranking algorithms are secret for obvious reasons, so I can't fully answer this question.