Description of Problem: When recursively descending large web sites (mirroring), wget will sometimes get into a state where it seems to get stuck in a loop (I usually use -nc so this consists of "xxx already there, will not retrieve, yyy already there will not retrieve, xxx already..."). I can't provide a meaningful test case that is of usable size. I'm going to go on a spelunking mission in the source to see if I can track it down using just my understanding of the source code (I've patched this sucker a number of times, so I'm already familiar). Right now, suggest you either (1) wait a few days until I either come up with something or concede concede defeat or (2) try the same course. TODO: see the most recent issue of MSDN. There's an article about a program for IIS (which would translate to a simple module or CGI for a real web server) that uses statistical analysis to kick mirroring programs off the site. I'm going to patch wget to select a random number from a range of "time-to-wait-between-connect" values to defeat it. Also, note the irresponsible and brain-dead suggestion that it would be best to only use the class C net address of the client to identify visitors, since DHCP defeats it otherwise. E.g.: I have a 4 bit netblock on a registered domain (geeksrus.net). Someone else I've never heard of on the same class C gets me banned from a website. Or you're on DHCP, so one guy bans 252 other people. Nice. server) Version-Release number of selected component (if applicable): How Reproducible: Steps to Reproduce: 1. 2. 3. Actual Results: Expected Results: Additional Information:
Just a test to see if I can submit a followup w/o logging in again.
I need a meaningful test case to look into it...
Still needing testcase... I've never seen it happen
Closing due to lack of feedback and reproducability.