Bug 995528 - mongo AutoReference SON manipulator reduces performance on large result sets
mongo AutoReference SON manipulator reduces performance on large result sets
Status: CLOSED CURRENTRELEASE
Product: Pulp
Classification: Community
Component: z_other (Show other bugs)
2.2 Beta
Unspecified Unspecified
high Severity high
: ---
: 2.3.0
Assigned To: Barnaby Court
Preethi Thomas
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-09 11:47 EDT by Jeff Ortel
Modified: 2013-12-09 09:31 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-12-09 09:31:50 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
benchmarking script (342 bytes, text/x-python)
2013-08-09 11:47 EDT, Jeff Ortel
no flags Details

  None (edit)
Description Jeff Ortel 2013-08-09 11:47:18 EDT
Created attachment 784902 [details]
benchmarking script

Description of problem:

While benchmarking the retrieval of all content units associated to a repository, I discovered that cursor iteration was very slow.  Simply fetching ALL of the units from the units_rpm collection took 9.2 seconds just iterate the cursor which had a result set of 3178 documents.  Using a hand created connection (not using pulp.server.db.connection) the same thing took, ~1.6 seconds.  And using the mongo CLI, ~900 ms.

The performance decrease seems to be related to the AutoReference mongo SON manipulator.  When commenting out:

_DATABASE.add_son_manipulator(AutoReference(_DATABASE))

The performance increased to 2.2 seconds.

I don't know what pulp uses the AutoReference for though suspect it has to do with the REST API.  But, for plugins fetching large result sets, this represents a pretty significant performance impact.  For example, for node operations such as publishing and syncing, it means a difference of 30 seconds, vs 4 seconds to iterate the cursor(s) when fetching the content units associated with a repository.

I attached the script used to do the benchmark.
Comment 1 Jeff Ortel 2013-09-10 10:32:43 EDT
Just learned that Katello folks routinely work with repositories containing 30k+ packages  (Yes, the CDN has repositories that big).  So, to put this in perspective, just to query the units in a repository that big will take ~92 seconds vs ~16 seconds (using really lazy math).  This affects every distributor and heavily impacts node sync.
Comment 2 Barnaby Court 2013-09-17 16:52:55 EDT
Pull request: https://github.com/pulp/pulp/pull/619
Comment 3 Jeff Ortel 2013-09-18 19:58:53 EDT
build: 2.3.0-0.14.alpha
Comment 4 Preethi Thomas 2013-10-09 17:24:50 EDT
verified

[root@hp-sl2x160zg6-01 ~]# pulp-admin repo list
+----------------------------------------------------------------------+
                              Repositories
+----------------------------------------------------------------------+

Id:                  rhel6-4
Display Name:        rhel6-4
Description:         None
Content Unit Counts: 
  Distribution:           1
  Erratum:                1956
  Package Category:       10
  Package Group:          202
  Rpm:                    11003
  Yum Repo Metadata File: 1


[root@hp-sl2x160zg6-01 ~]# python benchmark.py 
duration: 9.683 (seconds)
[root@hp-sl2x160zg6-01 ~]# 



<jortel> I only had 3178 and it took my little VM 2.2 seconds.  on my same VM, it would have taken 7.616 seconds to fetch 11,003 rpms.  seems about right.
<jortel> without the fix, it would have taken ~90 seconds.
Comment 5 Preethi Thomas 2013-12-09 09:31:50 EST
Pulp 2.3 released.

Note You need to log in before you can comment on or make changes to this bug.