Bug 1085087 - Packages are re-downloaded for every repository
Summary: Packages are re-downloaded for every repository
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Pulp
Classification: Retired
Component: rpm-support
Version: 2.3
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 2.4.0
Assignee: Sayli Karmarkar
QA Contact: Ina Panova
URL:
Whiteboard:
Depends On:
Blocks: 950743 1085089
TreeView+ depends on / blocked
 
Reported: 2014-04-07 19:09 UTC by Justin Sherrill
Modified: 2015-03-23 01:12 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1085089 (view as bug list)
Environment:
Last Closed: 2014-08-09 06:54:45 UTC
Embargoed:


Attachments (Terms of Use)
/var/log/httpd/access_log (5.20 KB, text/plain)
2014-05-19 09:46 UTC, Ina Panova
no flags Details

Description Justin Sherrill 2014-04-07 19:09:07 UTC
Description of problem:

Currently in pulp packages are re-downloaded every time they need to be pulled down per-repository.  If they already exist on the file system the existing file is not reused (after being validates to have the correct checksum)

We are under the impression this used to work in a previous version of pulp.

It is very advantageous to have this behavior as within katello:

* multiple orgs may sync the same repo resulting in multiple repositories within pulp.  A new org syncing the same repo should be much faster.
* During testing we can leave the contents of /var/lib/pulp/content alone and be able to sync repositories much much quicker after the initial sync.


Version-Release number of selected component (if applicable):

2.3


How reproducible:
Always

Steps to Reproduce:
1. Create a Repo pointing to a very large upstream repo
2. Sync the repo
3. Create another repo pointing to the same very large upstream repo
4. Sync the 2nd repo

Actual results:
All packages are re-downloaded

Expected results:
All packages should not be re-downloaded since they already exist on the file system.

Comment 1 Sayli Karmarkar 2014-04-11 07:32:17 UTC
I am not able to reproduce this bug. On my latest 2.4 setup, I synced 2 repos having some common packages and another repo with same feed but different relative url. After 3 syncs -



$ locate pulp-test-package-0.2.1-1.fc11.x86_64.rpm
/var/lib/pulp/content/rpm/pulp-test-package/0.2.1/1.fc11/x86_64/4dbde07b4a8eab57e42ed0c9203083f1d61e0b13935d1a569193ed8efc9ecfd7/pulp-test-package-0.2.1-1.fc11.x86_64.rpm
/var/lib/pulp/published/yum/master/repo_resync_a/1397198983.61/pulp-test-package-0.2.1-1.fc11.x86_64.rpm
/var/lib/pulp/published/yum/master/repo_resync_aa/1397201069.65/pulp-test-package-0.2.1-1.fc11.x86_64.rpm
/var/lib/pulp/published/yum/master/repo_resync_b/1397200634.46/pulp-test-package-0.2.1-1.fc11.x86_64.rpm


The one with the path name starting with /var/lib/pulp/content/ is the downloaded package and the other 3 are links pointing to the downloaded package for each repo.

Comment 2 Mike McCune 2014-04-13 19:52:01 UTC
It isn't an issue of file system location, it appears to just re-download and write it to the same location.  If you watch httpd traffic logs you can see here where I setup 2 repos with the same content in different locations:

http://example.com/pub/repo-1
http://example.com/pub/repo-2

I synced the 1st repo and it generated the following in the /var/log/httpd/access_log on the server hosting the yum repo:


172.31.1.95 - - [13/Apr/2014:12:43:22 -0700] "GET /pub/repo-1/repodata/repomd.xml HTTP/1.1" 200 3413 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:43:22 -0700] "GET /pub/repo-1/repodata/b9e2410f78f3898e6ea509df05cd7e5f2d422839d22d69b87755c222240e1692-filelists.xml.gz HTTP/1.1" 200 32816 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:43:22 -0700] "GET /pub/repo-1/repodata/0178e97f6fa2a9e0bbc8bd7cab406917bd8038cd7f10d65659a56b2b30e0c82a-updateinfo.xml.gz HTTP/1.1" 200 389 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:43:22 -0700] "GET /pub/repo-1/repodata/6f11e67d7583b7ec9ee296d4398a87d22e26b0b79c909efbec0bd0b5cb5321a4-primary.xml.gz HTTP/1.1" 200 57004 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:43:22 -0700] "GET /pub/repo-1/repodata/6eeb54891225aa57e5c463cd83860589a3e61e0b8ff2acbce65731998e220287-other.xml.gz HTTP/1.1" 200 23192 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:43:24 -0700] "GET /pub/repo-1/euphonism-7.3.0-1.elfake.noarch.rpm HTTP/1.1" 200 81744 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:43:24 -0700] "GET /pub/repo-1/knavishly-8.5.7-1.elfake.noarch.rpm HTTP/1.1" 200 91984 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
....
172.31.1.95 - - [13/Apr/2014:12:43:32 -0700] "GET /pub/repo-1/Anguillaria-8.8.6-1.elfake.noarch.rpm HTTP/1.1" 200 4104 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:43:32 -0700] "GET /pub/repo-1/plotted-0.0.10-1.elfake.noarch.rpm HTTP/1.1" 200 102168 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:43:32 -0700] "GET /pub/repo-1/rampageous-4.3.1-1.elfake.noarch.rpm HTTP/1.1" 200 57056 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:43:33 -0700] "GET /pub/repo-1/campholide-4.2.4-1.elfake.noarch.rpm HTTP/1.1" 200 55000 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:43:33 -0700] "GET /pub/repo-1/phlebotome-1.0.7-1.elfake.noarch.rpm HTTP/1.1" 200 71528 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:43:33 -0700] "GET /pub/repo-1/.treeinfo HTTP/1.1" 404 297 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:43:33 -0700] "GET /pub/repo-1/treeinfo HTTP/1.1" 404 296 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"



now downloading repo-2 after the 1st is finished, you can see it still downloads all the packages:

172.31.1.95 - - [13/Apr/2014:12:45:31 -0700] "GET /pub/repo-2/repodata/repomd.xml HTTP/1.1" 200 3413 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:45:31 -0700] "GET /pub/repo-2/repodata/b9e2410f78f3898e6ea509df05cd7e5f2d422839d22d69b87755c222240e1692-filelists.xml.gz HTTP/1.1" 200 32816 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:45:31 -0700] "GET /pub/repo-2/repodata/0178e97f6fa2a9e0bbc8bd7cab406917bd8038cd7f10d65659a56b2b30e0c82a-updateinfo.xml.gz HTTP/1.1" 200 389 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:45:31 -0700] "GET /pub/repo-2/repodata/6f11e67d7583b7ec9ee296d4398a87d22e26b0b79c909efbec0bd0b5cb5321a4-primary.xml.gz HTTP/1.1" 200 57004 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:45:31 -0700] "GET /pub/repo-2/repodata/6eeb54891225aa57e5c463cd83860589a3e61e0b8ff2acbce65731998e220287-other.xml.gz HTTP/1.1" 200 23192 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:45:34 -0700] "GET /pub/repo-2/knavishly-8.5.7-1.elfake.noarch.rpm HTTP/1.1" 200 91984 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
...
172.31.1.95 - - [13/Apr/2014:12:45:42 -0700] "GET /pub/repo-2/Anguillaria-8.8.6-1.elfake.noarch.rpm HTTP/1.1" 200 4104 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:45:42 -0700] "GET /pub/repo-2/rampageous-4.3.1-1.elfake.noarch.rpm HTTP/1.1" 200 57056 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:45:42 -0700] "GET /pub/repo-2/campholide-4.2.4-1.elfake.noarch.rpm HTTP/1.1" 200 55000 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:45:42 -0700] "GET /pub/repo-2/phlebotome-1.0.7-1.elfake.noarch.rpm HTTP/1.1" 200 71528 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:45:42 -0700] "GET /pub/repo-2/.treeinfo HTTP/1.1" 404 297 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"
172.31.1.95 - - [13/Apr/2014:12:45:42 -0700] "GET /pub/repo-2/treeinfo HTTP/1.1" 404 296 "-" "python-requests/2.0.0 CPython/2.6.6 Linux/2.6.32-431.el6.x86_64"


so you still pay the network bandwidth cost of re-downloading all the same content even thou the source feeds are different URLs.  This is the main complaint in the bug that this can add *significant* time to download large repositories that has the same content.

Comment 3 Sayli Karmarkar 2014-04-14 15:52:50 UTC
Agreed. I was able to see this as well on Friday and this is quite bad. Working on fixing it.

Comment 4 Nick Strugnell 2014-04-16 11:58:47 UTC
Not sure if this is the correct bug or whether I need a new BZ, but I see a probably related issue that pulp is downloading the same RPM twice, and storing in different paths - one with a sha1 checksum, the other with sha256.

e.g:

[nstrug@nstrug x86_64]$ ls -lR
.:
total 8
drwxr-xr-x. 3 nstrug nstrug 4096 Mar 17 23:53 2d03e1483a40fdfe7bb25a3aeaa3aaf02a798ec8d7ff42c74e2ec9f60d4c84a4
drwxr-xr-x. 3 nstrug nstrug 4096 Mar 18 11:52 48afb15cf7c5dc5f060a4947ad175903c504fc0e

./2d03e1483a40fdfe7bb25a3aeaa3aaf02a798ec8d7ff42c74e2ec9f60d4c84a4:
total 4
drwxr-xr-x. 2 nstrug nstrug 4096 Mar 18 16:19 Packages

./2d03e1483a40fdfe7bb25a3aeaa3aaf02a798ec8d7ff42c74e2ec9f60d4c84a4/Packages:
total 1128
-rw-r--r--. 1 nstrug nstrug 1152568 Mar 18 16:19 eclipse-oprofile-0.6.1-1.el6.x86_64.rpm

./48afb15cf7c5dc5f060a4947ad175903c504fc0e:
total 4
drwxr-xr-x. 2 nstrug nstrug 4096 Mar 19 13:08 Packages

./48afb15cf7c5dc5f060a4947ad175903c504fc0e/Packages:
total 1128
-rw-r--r--. 1 nstrug nstrug 1152568 Mar 19 13:08 eclipse-oprofile-0.6.1-1.el6.x86_64.rpm


Possibly to do with different checksumming schemed for the Server and Kickstart repos, but I'm not sure at the moment.

Comment 5 Justin Sherrill 2014-04-16 13:03:55 UTC
Nick, I've seen the same too but I do think it is a different issue.  I'm also not sure if the cdn should resolve it (by switching to sha256) or pulp should resolve it by identifying that the packages are the same.

Comment 6 Sayli Karmarkar 2014-04-22 09:04:25 UTC
https://github.com/pulp/pulp_rpm/pull/472

Comment 7 Randy Barlow 2014-04-24 20:28:38 UTC
The fix for this bug is included in the 2.4.0-0.10.beta build that was just published to the Pulp fedorapeople.org repository.

Comment 8 Irina Gulina 2014-05-16 15:13:10 UTC
So far we blocked with a bug 1098195 and can't check it with a large repo, so we did it with a small repo:

On Fedora 20 (Heisenbug):

>>rpm -qa | grep pulp-admin
pulp-admin-client-2.4.0-0.14.beta.fc20.noarch

>>pulp-admin -u * -p * rpm repo create --repo-id zoo --feed http://repos.fedorapeople.org/repos/pulp/pulp/demo_repos/zoo/
Successfully created repository [zoo]

>>pulp-admin -u * -p * rpm repo sync run --repo-id zoo
Task Succeeded

>>cat /var/log/httpd/access_log
109.68.191.26 - - [16/May/2014:11:22:17 +0000] "GET / HTTP/1.0" 403 4609 "-" "masscan/1.0 (https://github.com/robertdavidgraham/masscan)"
202.53.8.82 - - [16/May/2014:13:07:52 +0000] "GET /index.php?option=com_community HTTP/1.1" 404 207 "-" "-"

>>pulp-admin -u * -p * rpm repo create --repo-id newzoo --feed http://repos.fedorapeople.org/repos/pulp/pulp/demo_repos/zoo/
Successfully created repository [zoo]

The server indicated one or more values were incorrect. The server provided the
following error message:

   Relative URL [repos/pulp/pulp/demo_repos/zoo/] for repository [newzoo]
conflicts with existing relative URL [/repos/pulp/pulp/demo_repos/zoo/] for
repository [zoo]

More information can be found in the client log file ~/.pulp/admin.log.
>>cat ~/.pulp/admin.log
ERROR - Exception occurred:
        href:      /pulp/api/v2/repositories/
        method:    POST
        status:    400
        error:     Relative URL [repos/pulp/pulp/demo_repos/zoo/] for repository [newzoo] conflicts with existing relative URL [/repos/pulp/pulp/demo_repos/zoo/] for repository [zoo]
        traceback: None
        data:      {u'args': [u'Relative URL [repos/pulp/pulp/demo_repos/zoo/] for repository [newzoo] conflicts with existing relative URL [/repos/pulp/pulp/demo_repos/zoo/] for repository [zoo]'], u'error': {u'code': u'PLP0000', u'data': {}, u'description': u'Relative URL [repos/pulp/pulp/demo_repos/zoo/] for repository [newzoo] conflicts with existing relative URL [/repos/pulp/pulp/demo_repos/zoo/] for repository [zoo]', u'sub_errors': []}}

>>pulp-admin -u admin -p admin rpm repo create --repo-id other-zoo --feed http://repos.fedorapeople.org/repos/pulp/pulp/demo_repos/zoo/ --relative-url /my_custom_url/
Successfully created repository [newzoo]

>>pulp-admin -u admin -p admin rpm repo sync run --repo-id newzoo
Task Succeeded

>>cat /var/log/httpd/access_log
109.68.191.26 - - [16/May/2014:11:22:17 +0000] "GET / HTTP/1.0" 403 4609 "-" "masscan/1.0 (https://github.com/robertdavidgraham/masscan)"
202.53.8.82 - - [16/May/2014:13:07:52 +0000] "GET /index.php?option=com_community HTTP/1.1" 404 207 "-" "-"

so logs are the same => it wasn't downloaded the second time. Actually it wasn't downloaded for the first time as well, meaning we already had that package. 

Verified.

Comment 9 Ina Panova 2014-05-19 09:42:14 UTC
Tested in 2.4.0-0.14.beta:
1)First repo1 created and synced
2) Second repo2 created and synced, same feed but different relative url

Result:
1) timestamp on the file itself did not change
After repo1 sync
# ll -l /var/lib/pulp/content/rpm/bear/4.1/1/noarch/7a831f9f90bf4d21027572cb503d20b702de8e8785b02c0397445c2e481d81b3/bear-4.1-1.noarch.rpm 
-rw-r--r--. 1 apache apache 2438 May 19 09:25 /var/lib/pulp/content/rpm/bear/4.1/1/noarch/7a831f9f90bf4d21027572cb503d20b702de8e8785b02c0397445c2e481d81b3/bear-4.1-1.noarch.rpm

After repo2 sync
# ll -l /var/lib/pulp/content/rpm/bear/4.1/1/noarch/7a831f9f90bf4d21027572cb503d20b702de8e8785b02c0397445c2e481d81b3/bear-4.1-1.noarch.rpm -rw-r--r--. 1 apache apache 2438 May 19 09:25 /var/lib/pulp/content/rpm/bear/4.1/1/noarch/7a831f9f90bf4d21027572cb503d20b702de8e8785b02c0397445c2e481d81b3/bear-4.1-1.noarch.rp


2) From /var/log/httpd/access_log - when syncing repo2 only repomd.xml and repodata was checked. No packages were re-downloaded. Attaching the logs

Comment 10 Ina Panova 2014-05-19 09:46:44 UTC
Created attachment 897075 [details]
/var/log/httpd/access_log

Comment 11 Randy Barlow 2014-08-09 06:54:45 UTC
This has been fixed in Pulp 2.4.0-1.


Note You need to log in before you can comment on or make changes to this bug.