Bug 995528 - mongo AutoReference SON manipulator reduces performance on large result sets
Summary: mongo AutoReference SON manipulator reduces performance on large result sets
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Pulp
Classification: Retired
Component: z_other
Version: 2.2 Beta
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 2.3.0
Assignee: Barnaby Court
QA Contact: Preethi Thomas
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-09 15:47 UTC by Jeff Ortel
Modified: 2013-12-09 14:31 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-12-09 14:31:50 UTC
Embargoed:


Attachments (Terms of Use)
benchmarking script (342 bytes, text/x-python)
2013-08-09 15:47 UTC, Jeff Ortel
no flags Details

Description Jeff Ortel 2013-08-09 15:47:18 UTC
Created attachment 784902 [details]
benchmarking script

Description of problem:

While benchmarking the retrieval of all content units associated to a repository, I discovered that cursor iteration was very slow.  Simply fetching ALL of the units from the units_rpm collection took 9.2 seconds just iterate the cursor which had a result set of 3178 documents.  Using a hand created connection (not using pulp.server.db.connection) the same thing took, ~1.6 seconds.  And using the mongo CLI, ~900 ms.

The performance decrease seems to be related to the AutoReference mongo SON manipulator.  When commenting out:

_DATABASE.add_son_manipulator(AutoReference(_DATABASE))

The performance increased to 2.2 seconds.

I don't know what pulp uses the AutoReference for though suspect it has to do with the REST API.  But, for plugins fetching large result sets, this represents a pretty significant performance impact.  For example, for node operations such as publishing and syncing, it means a difference of 30 seconds, vs 4 seconds to iterate the cursor(s) when fetching the content units associated with a repository.

I attached the script used to do the benchmark.

Comment 1 Jeff Ortel 2013-09-10 14:32:43 UTC
Just learned that Katello folks routinely work with repositories containing 30k+ packages  (Yes, the CDN has repositories that big).  So, to put this in perspective, just to query the units in a repository that big will take ~92 seconds vs ~16 seconds (using really lazy math).  This affects every distributor and heavily impacts node sync.

Comment 2 Barnaby Court 2013-09-17 20:52:55 UTC
Pull request: https://github.com/pulp/pulp/pull/619

Comment 3 Jeff Ortel 2013-09-18 23:58:53 UTC
build: 2.3.0-0.14.alpha

Comment 4 Preethi Thomas 2013-10-09 21:24:50 UTC
verified

[root@hp-sl2x160zg6-01 ~]# pulp-admin repo list
+----------------------------------------------------------------------+
                              Repositories
+----------------------------------------------------------------------+

Id:                  rhel6-4
Display Name:        rhel6-4
Description:         None
Content Unit Counts: 
  Distribution:           1
  Erratum:                1956
  Package Category:       10
  Package Group:          202
  Rpm:                    11003
  Yum Repo Metadata File: 1


[root@hp-sl2x160zg6-01 ~]# python benchmark.py 
duration: 9.683 (seconds)
[root@hp-sl2x160zg6-01 ~]# 



<jortel> I only had 3178 and it took my little VM 2.2 seconds.  on my same VM, it would have taken 7.616 seconds to fetch 11,003 rpms.  seems about right.
<jortel> without the fix, it would have taken ~90 seconds.

Comment 5 Preethi Thomas 2013-12-09 14:31:50 UTC
Pulp 2.3 released.


Note You need to log in before you can comment on or make changes to this bug.