Bug 673535

Summary: Performance issue with repo.py::_get_existing_repo()
Product: [Retired] Pulp Reporter: John Matthews <jmatthew>
Component: z_otherAssignee: John Matthews <jmatthew>
Status: CLOSED CURRENTRELEASE QA Contact: Preethi Thomas <pthomas>
Severity: high Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: pkilambi, skarmark
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-16 12:10:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 647488    

Description John Matthews 2011-01-28 15:50:03 UTC
While debugging some performance issues I saw that refreshing a repo object takes a significant amount of time, the more packages in the repo document the longer it takes.

For some rough numbers look at:
pulp/playpen/perf/time_repos.py

This will run "_get_existing_repo()" for each repo id available in mongo.

Below are some examples, notice that for a rhel-i386-server-5 repo it takes 12 seconds to run this:

Packages: 0 Length: 707 _get_existing_repo(clone_repo500) took 0.000874042510986 seconds
Packages: 500 Length: 456182 _get_existing_repo(repo500) took 0.266913175583 seconds
Packages: 1118 Length: 3195775 _get_existing_repo(f14_testing) took 0.954846858978 seconds
Packages: 5736 Length: 8749039 _get_existing_repo(f14_updates) took 8.28494310379 seconds
Packages: 7184 Length: 10369577 _get_existing_repo(local1) took 12.5831038952 seconds

When we fetch the repo document we are fetching all the fields.  The packages field is the largest, contributing almost the entire size to the repo.  

One possible fix could be we explore running without the SON Manipulator for AutoReference.  The repo document is store in mongo with a series of package document id's, when we fetch the Repo object pymongo is automatically dereferencing each package id to the package object, so we get back a full repo document with package objects populated.

Comment 1 John Matthews 2011-01-28 15:53:55 UTC
Another note, this call is relied on heavily through out pulp.  It's often called in loops that are processing packages.

Comment 2 John Matthews 2011-02-01 20:31:49 UTC
The change chosen was to convert repo["packages"] to a list of package ids, it is no longer a dictionary of package objects.

http://git.fedorahosted.org/git/?p=pulp.git;a=commitdiff;h=f2e1506eff24394206a191d39eb7f6edd859d45d

QE: There is a script under pulp's git checkout, "playpen/perf/time_repos.py" which can be run to generate the timings in comment #0.   I would recommend syncing a large repo and running it.  You should see the _get_available_repos() call takes under a second to complete.

Comment 3 Jay Dobies 2011-02-02 20:07:09 UTC
Fixed in 0.134.

Comment 4 Preethi Thomas 2011-02-03 15:12:02 UTC
verified

[root@preethi perf]# ./time_repos.py 
time_get_available_repoids() @ time_repos.py:71 - 2 ids query for repositories took 0.00137710571289 seconds
time_get_available_repoids() @ time_repos.py:76 - 2 ids query for repositories custom took 0.0001380443573 seconds
time_get_existing_repo() @ time_repos.py:85 - All fields.  Packages: 0 Length: 554 _get_existing_repo(foo) took 0.000941038131714 seconds
time_get_existing_repo() @ time_repos.py:85 - All fields.  Packages: 5932 Length: 2239867 _get_existing_repo(f14-update) took 0.328789949417 seconds
time_get_existing_repo() @ time_repos.py:94 - Limited fields: Shortened fetch<no packages>: Length: 71 _get_existing_repo(foo) took 0.00431704521179 seconds
time_get_existing_repo() @ time_repos.py:94 - Limited fields: Shortened fetch<no packages>: Length: 120 _get_existing_repo(f14-update) took 0.000724792480469 seconds
time_package_lookups_batched() @ time_repos.py:119 - Full retrieval <batched> of Repo<554> foo and 0 packages<2> took 0.0267741680145 seconds.
time_package_lookups_batched() @ time_repos.py:119 - Full retrieval <batched> of Repo<2239867> f14-update and 5932 packages<8252543> took 2.4161260128 seconds.
[root@preethi perf]# 

[root@preethi perf]# rpm -q pulp
pulp-0.0.134-1.fc14.noarch

Comment 5 Preethi Thomas 2011-08-16 12:10:51 UTC
Closing with Community Release 15

pulp-0.0.223-4.

Comment 6 Preethi Thomas 2011-08-16 12:22:37 UTC
Closing with Community Release 15

pulp-0.0.223-4.