Protection against the spiders

Recently I've been using Apache::MiniWiki for, well, a Wiki. With a wiki, there is an [Edit] link on every page. This takes one to a textarea where one can edit the page's text using a really simple non-html wiki markup syntax. Somehow a bot (written in Java of all things) managed to follow the Edit href, and submit the empty textarea. In a matter of minutes it spidered the entire site, leaving all the pages blank. Fortunately all the pages are stored in RCS (via Rcs.pm), so rolling back was simple. I then added code to force text to be entered in the textarea.

Because of this, I added capabilities to view an revision of a page, and revert back to any version. Now the robots (Google, Scooter, Openfind, etc) were retrieving all the old rersions, and following the links to revert to any revision. Reverting (merging) to an old version generates and new revision number... and they got trapped, and spent all day busily reverting pages, looking at the new pages, and so on. As soon as I noticed this behaviour, I added the save/revert/log urls into robots.txt, but most spiders don't reload robots.txt very often.

I've been considering writing an Access handler that will attempt to detect whether a browser is a human or not. It would only be used for URLs blocked in robots.txt. The first time somebody comes to a 'protected' url, they get presented with a simple test to see whether they are human or not. Possibilities for a test include:

Or possibly incorporate some of the tactics from:

http://www.neilgunton.com/spambot_trap/


See also MiniWiki