Description of problem: Currently in pulp packages are re-downloaded every time they need to be pulled down per-repository. If they already exist on the file system the existing file is not reused (after being validates to have the correct checksum) We are under the impression this used to work in a previous version of pulp. It is very advantageous to have this behavior as within katello: * multiple orgs may sync the same repo resulting in multiple repositories within pulp. A new org syncing the same repo should be much faster. * During testing we can leave the contents of /var/lib/pulp/content alone and be able to sync repositories much much quicker after the initial sync. Version-Release number of selected component (if applicable): 2.3 How reproducible: Always Steps to Reproduce: 1. Create a Repo pointing to a very large upstream repo 2. Sync the repo 3. Create another repo pointing to the same very large upstream repo 4. Sync the 2nd repo Actual results: All packages are re-downloaded Expected results: All packages should not be re-downloaded since they already exist on the file system.
I am not able to reproduce this bug. On my latest 2.4 setup, I synced 2 repos having some common packages and another repo with same feed but different relative url. After 3 syncs - $ locate pulp-test-package-0.2.1-1.fc11.x86_64.rpm /var/lib/pulp/content/rpm/pulp-test-package/0.2.1/1.fc11/x86_64/4dbde07b4a8eab57e42ed0c9203083f1d61e0b13935d1a569193ed8efc9ecfd7/pulp-test-package-0.2.1-1.fc11.x86_64.rpm /var/lib/pulp/published/yum/master/repo_resync_a/1397198983.61/pulp-test-package-0.2.1-1.fc11.x86_64.rpm /var/lib/pulp/published/yum/master/repo_resync_aa/1397201069.65/pulp-test-package-0.2.1-1.fc11.x86_64.rpm /var/lib/pulp/published/yum/master/repo_resync_b/1397200634.46/pulp-test-package-0.2.1-1.fc11.x86_64.rpm The one with the path name starting with /var/lib/pulp/content/ is the downloaded package and the other 3 are links pointing to the downloaded package for each repo.
It isn't an issue of file system location, it appears to just re-download and write it to the same location. If you watch httpd traffic logs you can see here where I setup 2 repos with the same content in different locations: http://example.com/pub/repo-1 http://example.com/pub/repo-2 I synced the 1st repo and it generated the following in the /var/log/httpd/access_log on the server hosting the yum repo: 172.31.1.95 - - [13/Apr/2014:12:43:22 -0700] "GET /pub/repo-1/repodata/repomd.xml HTTP/1.1" 200 3413 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:43:22 -0700] "GET /pub/repo-1/repodata/b9e2410f78f3898e6ea509df05cd7e5f2d422839d22d69b87755c222240e1692-filelists.xml.gz HTTP/1.1" 200 32816 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:43:22 -0700] "GET /pub/repo-1/repodata/0178e97f6fa2a9e0bbc8bd7cab406917bd8038cd7f10d65659a56b2b30e0c82a-updateinfo.xml.gz HTTP/1.1" 200 389 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:43:22 -0700] "GET /pub/repo-1/repodata/6f11e67d7583b7ec9ee296d4398a87d22e26b0b79c909efbec0bd0b5cb5321a4-primary.xml.gz HTTP/1.1" 200 57004 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:43:22 -0700] "GET /pub/repo-1/repodata/6eeb54891225aa57e5c463cd83860589a3e61e0b8ff2acbce65731998e220287-other.xml.gz HTTP/1.1" 200 23192 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:43:24 -0700] "GET /pub/repo-1/euphonism-7.3.0-1.elfake.noarch.rpm HTTP/1.1" 200 81744 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:43:24 -0700] "GET /pub/repo-1/knavishly-8.5.7-1.elfake.noarch.rpm HTTP/1.1" 200 91984 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" .... 172.31.1.95 - - [13/Apr/2014:12:43:32 -0700] "GET /pub/repo-1/Anguillaria-8.8.6-1.elfake.noarch.rpm HTTP/1.1" 200 4104 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:43:32 -0700] "GET /pub/repo-1/plotted-0.0.10-1.elfake.noarch.rpm HTTP/1.1" 200 102168 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:43:32 -0700] "GET /pub/repo-1/rampageous-4.3.1-1.elfake.noarch.rpm HTTP/1.1" 200 57056 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:43:33 -0700] "GET /pub/repo-1/campholide-4.2.4-1.elfake.noarch.rpm HTTP/1.1" 200 55000 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:43:33 -0700] "GET /pub/repo-1/phlebotome-1.0.7-1.elfake.noarch.rpm HTTP/1.1" 200 71528 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:43:33 -0700] "GET /pub/repo-1/.treeinfo HTTP/1.1" 404 297 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:43:33 -0700] "GET /pub/repo-1/treeinfo HTTP/1.1" 404 296 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" now downloading repo-2 after the 1st is finished, you can see it still downloads all the packages: 172.31.1.95 - - [13/Apr/2014:12:45:31 -0700] "GET /pub/repo-2/repodata/repomd.xml HTTP/1.1" 200 3413 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:45:31 -0700] "GET /pub/repo-2/repodata/b9e2410f78f3898e6ea509df05cd7e5f2d422839d22d69b87755c222240e1692-filelists.xml.gz HTTP/1.1" 200 32816 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:45:31 -0700] "GET /pub/repo-2/repodata/0178e97f6fa2a9e0bbc8bd7cab406917bd8038cd7f10d65659a56b2b30e0c82a-updateinfo.xml.gz HTTP/1.1" 200 389 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:45:31 -0700] "GET /pub/repo-2/repodata/6f11e67d7583b7ec9ee296d4398a87d22e26b0b79c909efbec0bd0b5cb5321a4-primary.xml.gz HTTP/1.1" 200 57004 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:45:31 -0700] "GET /pub/repo-2/repodata/6eeb54891225aa57e5c463cd83860589a3e61e0b8ff2acbce65731998e220287-other.xml.gz HTTP/1.1" 200 23192 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:45:34 -0700] "GET /pub/repo-2/knavishly-8.5.7-1.elfake.noarch.rpm HTTP/1.1" 200 91984 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" ... 172.31.1.95 - - [13/Apr/2014:12:45:42 -0700] "GET /pub/repo-2/Anguillaria-8.8.6-1.elfake.noarch.rpm HTTP/1.1" 200 4104 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:45:42 -0700] "GET /pub/repo-2/rampageous-4.3.1-1.elfake.noarch.rpm HTTP/1.1" 200 57056 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:45:42 -0700] "GET /pub/repo-2/campholide-4.2.4-1.elfake.noarch.rpm HTTP/1.1" 200 55000 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:45:42 -0700] "GET /pub/repo-2/phlebotome-1.0.7-1.elfake.noarch.rpm HTTP/1.1" 200 71528 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:45:42 -0700] "GET /pub/repo-2/.treeinfo HTTP/1.1" 404 297 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" 172.31.1.95 - - [13/Apr/2014:12:45:42 -0700] "GET /pub/repo-2/treeinfo HTTP/1.1" 404 296 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64" so you still pay the network bandwidth cost of re-downloading all the same content even thou the source feeds are different URLs. This is the main complaint in the bug that this can add *significant* time to download large repositories that has the same content.
Agreed. I was able to see this as well on Friday and this is quite bad. Working on fixing it.
Not sure if this is the correct bug or whether I need a new BZ, but I see a probably related issue that pulp is downloading the same RPM twice, and storing in different paths - one with a sha1 checksum, the other with sha256. e.g: [nstrug@nstrug x86_64]$ ls -lR .: total 8 drwxr-xr-x. 3 nstrug nstrug 4096 Mar 17 23:53 2d03e1483a40fdfe7bb25a3aeaa3aaf02a798ec8d7ff42c74e2ec9f60d4c84a4 drwxr-xr-x. 3 nstrug nstrug 4096 Mar 18 11:52 48afb15cf7c5dc5f060a4947ad175903c504fc0e ./2d03e1483a40fdfe7bb25a3aeaa3aaf02a798ec8d7ff42c74e2ec9f60d4c84a4: total 4 drwxr-xr-x. 2 nstrug nstrug 4096 Mar 18 16:19 Packages ./2d03e1483a40fdfe7bb25a3aeaa3aaf02a798ec8d7ff42c74e2ec9f60d4c84a4/Packages: total 1128 -rw-r--r--. 1 nstrug nstrug 1152568 Mar 18 16:19 eclipse-oprofile-0.6.1-1.el6.x86_64.rpm ./48afb15cf7c5dc5f060a4947ad175903c504fc0e: total 4 drwxr-xr-x. 2 nstrug nstrug 4096 Mar 19 13:08 Packages ./48afb15cf7c5dc5f060a4947ad175903c504fc0e/Packages: total 1128 -rw-r--r--. 1 nstrug nstrug 1152568 Mar 19 13:08 eclipse-oprofile-0.6.1-1.el6.x86_64.rpm Possibly to do with different checksumming schemed for the Server and Kickstart repos, but I'm not sure at the moment.
Nick, I've seen the same too but I do think it is a different issue. I'm also not sure if the cdn should resolve it (by switching to sha256) or pulp should resolve it by identifying that the packages are the same.
https://github.com/pulp/pulp_rpm/pull/472
The fix for this bug is included in the 2.4.0-0.10.beta build that was just published to the Pulp fedorapeople.org repository.
So far we blocked with a bug 1098195 and can't check it with a large repo, so we did it with a small repo: On Fedora 20 (Heisenbug): >>rpm -qa | grep pulp-admin pulp-admin-client-2.4.0-0.14.beta.fc20.noarch >>pulp-admin -u * -p * rpm repo create --repo-id zoo --feed http://repos.fedorapeople.org/repos/pulp/pulp/demo_repos/zoo/ Successfully created repository [zoo] >>pulp-admin -u * -p * rpm repo sync run --repo-id zoo Task Succeeded >>cat /var/log/httpd/access_log 109.68.191.26 - - [16/May/2014:11:22:17 +0000] "GET / HTTP/1.0" 403 4609 "-" "masscan/1.0 (https://github.com/robertdavidgraham/masscan)" 202.53.8.82 - - [16/May/2014:13:07:52 +0000] "GET /index.php?option=com_community HTTP/1.1" 404 207 "-" "-" >>pulp-admin -u * -p * rpm repo create --repo-id newzoo --feed http://repos.fedorapeople.org/repos/pulp/pulp/demo_repos/zoo/ Successfully created repository [zoo] The server indicated one or more values were incorrect. The server provided the following error message: Relative URL [repos/pulp/pulp/demo_repos/zoo/] for repository [newzoo] conflicts with existing relative URL [/repos/pulp/pulp/demo_repos/zoo/] for repository [zoo] More information can be found in the client log file ~/.pulp/admin.log. >>cat ~/.pulp/admin.log ERROR - Exception occurred: href: /pulp/api/v2/repositories/ method: POST status: 400 error: Relative URL [repos/pulp/pulp/demo_repos/zoo/] for repository [newzoo] conflicts with existing relative URL [/repos/pulp/pulp/demo_repos/zoo/] for repository [zoo] traceback: None data: {u'args': [u'Relative URL [repos/pulp/pulp/demo_repos/zoo/] for repository [newzoo] conflicts with existing relative URL [/repos/pulp/pulp/demo_repos/zoo/] for repository [zoo]'], u'error': {u'code': u'PLP0000', u'data': {}, u'description': u'Relative URL [repos/pulp/pulp/demo_repos/zoo/] for repository [newzoo] conflicts with existing relative URL [/repos/pulp/pulp/demo_repos/zoo/] for repository [zoo]', u'sub_errors': []}} >>pulp-admin -u admin -p admin rpm repo create --repo-id other-zoo --feed http://repos.fedorapeople.org/repos/pulp/pulp/demo_repos/zoo/ --relative-url /my_custom_url/ Successfully created repository [newzoo] >>pulp-admin -u admin -p admin rpm repo sync run --repo-id newzoo Task Succeeded >>cat /var/log/httpd/access_log 109.68.191.26 - - [16/May/2014:11:22:17 +0000] "GET / HTTP/1.0" 403 4609 "-" "masscan/1.0 (https://github.com/robertdavidgraham/masscan)" 202.53.8.82 - - [16/May/2014:13:07:52 +0000] "GET /index.php?option=com_community HTTP/1.1" 404 207 "-" "-" so logs are the same => it wasn't downloaded the second time. Actually it wasn't downloaded for the first time as well, meaning we already had that package. Verified.
Tested in 2.4.0-0.14.beta: 1)First repo1 created and synced 2) Second repo2 created and synced, same feed but different relative url Result: 1) timestamp on the file itself did not change After repo1 sync # ll -l /var/lib/pulp/content/rpm/bear/4.1/1/noarch/7a831f9f90bf4d21027572cb503d20b702de8e8785b02c0397445c2e481d81b3/bear-4.1-1.noarch.rpm -rw-r--r--. 1 apache apache 2438 May 19 09:25 /var/lib/pulp/content/rpm/bear/4.1/1/noarch/7a831f9f90bf4d21027572cb503d20b702de8e8785b02c0397445c2e481d81b3/bear-4.1-1.noarch.rpm After repo2 sync # ll -l /var/lib/pulp/content/rpm/bear/4.1/1/noarch/7a831f9f90bf4d21027572cb503d20b702de8e8785b02c0397445c2e481d81b3/bear-4.1-1.noarch.rpm -rw-r--r--. 1 apache apache 2438 May 19 09:25 /var/lib/pulp/content/rpm/bear/4.1/1/noarch/7a831f9f90bf4d21027572cb503d20b702de8e8785b02c0397445c2e481d81b3/bear-4.1-1.noarch.rp 2) From /var/log/httpd/access_log - when syncing repo2 only repomd.xml and repodata was checked. No packages were re-downloaded. Attaching the logs
Created attachment 897075 [details] /var/log/httpd/access_log
This has been fixed in Pulp 2.4.0-1.