Opened 17 years ago

Closed 17 years ago

#140 closed task (fixed)

Problem Spiders Pulling Big Trac Changesets

Reported by: warmerdam Owned by: warmerdam
Priority: normal Milestone:
Component: SysAdmin Keywords:
Cc:

Description

Some spiders are still walking trac.osgeo.org, ignoring the robots.txt and end up pulling huge changesets out bringing www.osgeo.org to it's knees.

Change History (2)

comment:1 by warmerdam, 17 years ago

I have applied the "Lay Spider Traps" pattern from:

http://www.leekillough.com/robots.html

I have added a spider trap at the bottom of http://trac.osgeo.org/index.html to bad.html (actual url deliberately avoided!)

This is redirected to /cgi-bin/bad.pl, a perl script that adds the offending ip to /var/www/trac/forbidden_ips.txt along with a user agent info comment. All ips in this file are magically redirected to a 403 error by additional rewrite rules using it as a map. The apache conf magic is in /etc/httpd/conf.d/hosts/trac.conf:

   ########################################
   # Spider Trap Magic: See http://trac.osgeo.org/osgeo/ticket/140
   RewriteEngine on
   RewriteMap  bad txt:/var/www/trac/forbidden_ips.txt
   RewriteCond ${bad:%{REMOTE_ADDR}|NOT-FOUND} !=NOT-FOUND
   RewriteRule .* - [F,L]

   RewriteEngine on
   RewriteRule ^/bad\.html$ /cgi-bin/bad.pl [L,T=application/x-httpd-cgi]

   <Directory "/var/www/trac/cgi-bin">
    AllowOverride None
    Options ExecCGI
    Order allow,deny
    Allow from all
   </Directory>

   # End of Spider Trap Magic.
   ########################################

The cgi script is /var/www/trac/cgi-bin/bad.pl

Note bad.html was added to /var/www/trac/robots.txt

comment:2 by warmerdam, 17 years ago

Resolution: fixed
Status: newclosed

Closing under the optimistic assumption that this will take care of a problem. I've primed the forbidden_ips.txt with two known spider IPs.

Note: See TracTickets for help on using tickets.