Description of problem: pulp-admin repo discovery generates rogue requests (10Mbit /sec) to mirrors Version-Release number of selected component (if applicable): pulp-1.1.11-1.el6.noarch How reproducible: [root@pulp pulp]# pulp-admin repo discovery --url http://centos.mirror.triple-it.nl/6.3/os/x86_64/ --type yum Discovering urls with yum metadata, This could take some time... 2012-07-11 23:47:04,071 4070:140200049100544: pulp.server.webservices.controllers.services:INFO: services:457 Discovering compatible repo urls @ [http://centos.mirror.triple-it.nl/6.3/os/x86_64/] Number of Urls Discovered (|): 97 it keeps going after 97, even ctrl-c does not solve the problem. The only way to stop this is to stop pulp-server and delete the task from mongo (All tips on other ways to handle this are welcome) also tested with http://centos.mirror.triple-it.nl/6.3/os/x86_64/ does not happen however with http://be.mirror.eurid.eu/centos/6.3/os/x86_64/ Steps to Reproduce: 1. pulp-admin repo discovery --url http://centos.cu.be/6.3/os/x86_64/ --type yum 2. watch counter go up 3. ctrl-c Actual results: a) a continous growing number of found repos (100+) b) A constant flow of requests to $url/6.3/os/x86_64/isolinux/../isolinux... c) The repo admin complaining that I was sending rogue requests at 10Mbit/sec Expected results: Detection of a couple of repos and the menu to select them
I am able to reproduce this issue on a personal mirror at a slightly lower rate, but it still causes significant network traffic issues on AWS.
Basically discovery is grabbing the html and extracting the anchor tags and looking for specific metadata. In this case, looks like the url discovered has a '../' which will turn out to be a valid url to discover and now traverses down the tree. This will obviously generate a whole bunch of requests as the depth is traversed through. The urls i tried without a '../' which is most usually the case work fine as the root is the url we start with. I'll see if there is any elegant way to detect this, but its doing the expected based on how the tree is presented in the given url.
I'll just add a touch more context: Many repositories are setup as a web server just serving static files in a directory structure, with the web server generating indexes dynamically. Default behavior of common web servers such as Apache is to generate a link for the parent directory. http://httpd.apache.org/docs/2.2/mod/mod_autoindex.html A problem is that the link is sometimes in different formats. If you look at the HTML source for the link below, you will see that the parent link is absolute, not relative. http://be.mirror.eurid.eu/centos/6.3/os/x86_64/ However for this next repo, the link is relative ("../"): http://centos.cu.be/6.3/os/x86_64/ Since this is default behavior of popular web servers, we should be able to identify and ignore parent directory links. And because of the differences shown above, we'll need to watch for both relative and absolute paths that would take us up the tree.
Pulp 2.X no longer has the repo discovery feature, so I am closing this bug.