Bug 681866 - adding packages to a repo is very slow
Summary: adding packages to a repo is very slow
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Pulp
Classification: Retired
Component: z_other
Version: unspecified
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: Sprint 21
Assignee: John Matthews
QA Contact: Preethi Thomas
URL:
Whiteboard:
Depends On:
Blocks: 563609
TreeView+ depends on / blocked
 
Reported: 2011-03-03 13:14 UTC by Daniel Mach
Modified: 2012-02-24 20:13 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-02-24 20:13:23 UTC


Attachments (Terms of Use)
performance patch (7.83 KB, patch)
2011-03-08 15:39 UTC, Daniel Mach
no flags Details | Diff

Description Daniel Mach 2011-03-03 13:14:17 UTC
Adding packages to a repo works this way:

pids = []
for pkginfo in pkglist:
    src_pkgobj = self.service_api.search_packages(filename=pkg)
    pids.append(src_pkgobj['id'])
repository_api.add_package(id, pids)

This is *very* slow when adding a lot of packages.

A new call is required:
repository_api.add_package(repo_id, [(fn, sha256sum), (fn, sha256sum)])

Even better would be to provide multiple repos at once and do all work in a single transaction on server. Arguments would look like:
{
  (fn, sha256sum): [repos],
  (fn, sha256sum): [repos],
  ...
}


Not sure about JSON limitations, but this wouldn't work with XML-RPC.
In that case, you'll need to use following args:
[
 [fn, sha256sum, [repos]],
 [fn, sha256sum, [repos]],
]

or

[
  {"filename": fn, "sha256sum": sha256sum, "repos": repos},

]
(this one is more expensive in terms of bandwith and serialization due to dictionary keys)

Comment 1 John Matthews 2011-03-05 02:35:15 UTC
Added a new web service call

/services/associate/packages

It is under the client API
services.py associate_packages()

There is an example script using this call in git under:
playpen/api/package_associate.py

ServiceAPI::associate_packages()
takes a list as input which has the format of:
[((filename,checksum),[repo_ids])]

This commit introduced this call:
http://git.fedorahosted.org/git/?p=pulp.git;a=commit;h=892986f4e5e3d3e981fa0a86c4103dac3e041212

Comment 2 Daniel Mach 2011-03-07 14:16:48 UTC
The call works, but it's still slow (it's just a server-side wrapper doing the same thing).

Not sure where the problem is exactly,
but my guess is the self.add_package() call in associate_packages().

Comment 3 John Matthews 2011-03-07 14:45:19 UTC
Thanks for the feedback Daniel, I'm looking at the performance now.

Comment 4 John Matthews 2011-03-07 19:31:35 UTC
Daniel,

I made some changes today and this is what I'm seeing:

Working with rhel-i386-server-5, roughly 7000 packages.
Using test script under: playpen/api/package_associate.py
Finished in 17 minutes 23 seconds.

Breakdown:
 ~ 5 minutes is spent in mongo queries translating the filename,checksum pairs to package ids
 ~ 2 minutes is spent on add_package logic
 ~ 10 minutes in createrepo running for first time on repo


5 minutes to translate 7000 filename,checksums to package ids can likely be improved.

This still leaves us with roughly 10 minutes for calling createrepo to generate the package metadata.


Commit:
http://git.fedorahosted.org/git/?p=pulp.git;a=commitdiff;h=5e99f59f55f5f791675c9215330ae3459dd7e7ac

Comment 5 Daniel Mach 2011-03-08 15:39:58 UTC
Created attachment 482926 [details]
performance patch

This should boost performance a bit.

Could you review it and test it properly?
I didn't do much testing except some file uploads...

Comment 6 John Matthews 2011-03-10 05:27:26 UTC
Daniel,

I incorporated parts of your patch into changes I made tonight.  Performance has improved a lot, the previous runs of 17 minutes is now down to 11.5 minutes, of that create repo is taking up 11 minutes.  Looks like roughly 30 seconds is spent on processing pulp logic.


2011-03-10 00:01:53,810 [INFO][Dummy-2] associate_packages() @ repo.py:1834 - Translated 7184 filename,checksums in 1.62668609619 seconds
2011-03-10 00:01:55,861 [INFO][Dummy-2] add_package() @ repo.py:801 - Finished created pkg_object in 2.04557085037 seconds
2011-03-10 00:01:56,013 [INFO][Dummy-2] add_package() @ repo.py:826 - Finished check of NEVRA/filename in argument data by 2.1964328289 seconds
2011-03-10 00:02:02,068 [INFO][Dummy-2] add_package() @ repo.py:848 - Finished check of existing NEVRA by 8.25247097015 seconds
2011-03-10 00:02:02,109 [INFO][Dummy-2] add_package() @ repo.py:861 - Finished check of get_packages_by_filename() by 8.29352998734 seconds
2011-03-10 00:02:09,768 [INFO][Dummy-2] add_package() @ repo.py:878 - inside of repo.add_packages() adding packages took 15.9525318146 seconds
2011-03-10 00:13:21,777 [INFO][Dummy-2] create_repo() @ util.py:353 - [createrepo --checksum sha256 --update /var/lib/pulp//repos/time_test_g] on /var/lib/pulp//repos/time_test_g finished
2011-03-10 00:13:21,913 [ERROR][Dummy-2] associate_packages() @ repo.py:1848 - repo.add_package(time_test_g) for 7184 packages took 688.102610111 seconds


$ time ./package_associate.py rhel-i386-5.csv time_test_g
Success, no errors occurred

real	11m30.231s
user	0m0.263s
sys	0m0.045s


http://git.fedorahosted.org/git/?p=pulp.git;a=commitdiff;h=b2c85da3e12da255e30def97547a9e0e881d9627


It's a bit late and will be ending for tonight, tomorrow I'll test this further.

Comment 7 Jeff Ortel 2011-03-10 16:43:09 UTC
Build: 0.147

Comment 8 Preethi Thomas 2012-02-24 16:45:22 UTC
moving to verified.

Comment 9 Preethi Thomas 2012-02-24 20:13:23 UTC
Pulp v1.0 is released
Closed Current Release.


Note You need to log in before you can comment on or make changes to this bug.