| Summary: | adding packages to a repo is very slow | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Retired] Pulp | Reporter: | Daniel Mach <dmach> | ||||
| Component: | z_other | Assignee: | John Matthews <jmatthew> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Preethi Thomas <pthomas> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | unspecified | CC: | jortel, tsanders | ||||
| Target Milestone: | --- | Keywords: | Triaged | ||||
| Target Release: | Sprint 21 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2012-02-24 20:13:23 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 563609 | ||||||
| Attachments: |
|
||||||
Added a new web service call /services/associate/packages It is under the client API services.py associate_packages() There is an example script using this call in git under: playpen/api/package_associate.py ServiceAPI::associate_packages() takes a list as input which has the format of: [((filename,checksum),[repo_ids])] This commit introduced this call: http://git.fedorahosted.org/git/?p=pulp.git;a=commit;h=892986f4e5e3d3e981fa0a86c4103dac3e041212 The call works, but it's still slow (it's just a server-side wrapper doing the same thing). Not sure where the problem is exactly, but my guess is the self.add_package() call in associate_packages(). Thanks for the feedback Daniel, I'm looking at the performance now. Daniel, I made some changes today and this is what I'm seeing: Working with rhel-i386-server-5, roughly 7000 packages. Using test script under: playpen/api/package_associate.py Finished in 17 minutes 23 seconds. Breakdown: ~ 5 minutes is spent in mongo queries translating the filename,checksum pairs to package ids ~ 2 minutes is spent on add_package logic ~ 10 minutes in createrepo running for first time on repo 5 minutes to translate 7000 filename,checksums to package ids can likely be improved. This still leaves us with roughly 10 minutes for calling createrepo to generate the package metadata. Commit: http://git.fedorahosted.org/git/?p=pulp.git;a=commitdiff;h=5e99f59f55f5f791675c9215330ae3459dd7e7ac Created attachment 482926 [details]
performance patch
This should boost performance a bit.
Could you review it and test it properly?
I didn't do much testing except some file uploads...
Daniel, I incorporated parts of your patch into changes I made tonight. Performance has improved a lot, the previous runs of 17 minutes is now down to 11.5 minutes, of that create repo is taking up 11 minutes. Looks like roughly 30 seconds is spent on processing pulp logic. 2011-03-10 00:01:53,810 [INFO][Dummy-2] associate_packages() @ repo.py:1834 - Translated 7184 filename,checksums in 1.62668609619 seconds 2011-03-10 00:01:55,861 [INFO][Dummy-2] add_package() @ repo.py:801 - Finished created pkg_object in 2.04557085037 seconds 2011-03-10 00:01:56,013 [INFO][Dummy-2] add_package() @ repo.py:826 - Finished check of NEVRA/filename in argument data by 2.1964328289 seconds 2011-03-10 00:02:02,068 [INFO][Dummy-2] add_package() @ repo.py:848 - Finished check of existing NEVRA by 8.25247097015 seconds 2011-03-10 00:02:02,109 [INFO][Dummy-2] add_package() @ repo.py:861 - Finished check of get_packages_by_filename() by 8.29352998734 seconds 2011-03-10 00:02:09,768 [INFO][Dummy-2] add_package() @ repo.py:878 - inside of repo.add_packages() adding packages took 15.9525318146 seconds 2011-03-10 00:13:21,777 [INFO][Dummy-2] create_repo() @ util.py:353 - [createrepo --checksum sha256 --update /var/lib/pulp//repos/time_test_g] on /var/lib/pulp//repos/time_test_g finished 2011-03-10 00:13:21,913 [ERROR][Dummy-2] associate_packages() @ repo.py:1848 - repo.add_package(time_test_g) for 7184 packages took 688.102610111 seconds $ time ./package_associate.py rhel-i386-5.csv time_test_g Success, no errors occurred real 11m30.231s user 0m0.263s sys 0m0.045s http://git.fedorahosted.org/git/?p=pulp.git;a=commitdiff;h=b2c85da3e12da255e30def97547a9e0e881d9627 It's a bit late and will be ending for tonight, tomorrow I'll test this further. Build: 0.147 moving to verified. Pulp v1.0 is released Closed Current Release. |
Adding packages to a repo works this way: pids = [] for pkginfo in pkglist: src_pkgobj = self.service_api.search_packages(filename=pkg) pids.append(src_pkgobj['id']) repository_api.add_package(id, pids) This is *very* slow when adding a lot of packages. A new call is required: repository_api.add_package(repo_id, [(fn, sha256sum), (fn, sha256sum)]) Even better would be to provide multiple repos at once and do all work in a single transaction on server. Arguments would look like: { (fn, sha256sum): [repos], (fn, sha256sum): [repos], ... } Not sure about JSON limitations, but this wouldn't work with XML-RPC. In that case, you'll need to use following args: [ [fn, sha256sum, [repos]], [fn, sha256sum, [repos]], ] or [ {"filename": fn, "sha256sum": sha256sum, "repos": repos}, ] (this one is more expensive in terms of bandwith and serialization due to dictionary keys)