Description of Problem:
When recursively descending large web sites (mirroring), wget will
sometimes get into a state where it seems to get stuck in a loop (I
usually use -nc so this consists of "xxx already there, will not
retrieve, yyy already there will not retrieve, xxx already...").
I can't provide a meaningful test case that is of usable size. I'm going
to go on a spelunking mission in the source to see if I can track it down
using just my understanding of the source code (I've patched this sucker
a number of times, so I'm already familiar).
Right now, suggest you either (1) wait a few days until I either come up
with something or concede concede defeat or (2) try the same course.
TODO: see the most recent issue of MSDN. There's an article about a
program for IIS (which would translate to a simple module or CGI for a
real web server) that uses statistical analysis to kick mirroring
programs off the site. I'm going to patch wget to select a random number
from a range of "time-to-wait-between-connect" values to defeat it. Also,
note the irresponsible and brain-dead suggestion that it would be best to
only use the class C net address of the client to identify visitors,
since DHCP defeats it otherwise. E.g.: I have a 4 bit netblock on a
registered domain (geeksrus.net). Someone else I've never heard of on the
same class C gets me banned from a website. Or you're on DHCP, so one guy
bans 252 other people. Nice.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Just a test to see if I can submit a followup w/o logging in again.
I need a meaningful test case to look into it...
Still needing testcase... I've never seen it happen
Closing due to lack of feedback and reproducability.