Created attachment 784902 [details] benchmarking script Description of problem: While benchmarking the retrieval of all content units associated to a repository, I discovered that cursor iteration was very slow. Simply fetching ALL of the units from the units_rpm collection took 9.2 seconds just iterate the cursor which had a result set of 3178 documents. Using a hand created connection (not using pulp.server.db.connection) the same thing took, ~1.6 seconds. And using the mongo CLI, ~900 ms. The performance decrease seems to be related to the AutoReference mongo SON manipulator. When commenting out: _DATABASE.add_son_manipulator(AutoReference(_DATABASE)) The performance increased to 2.2 seconds. I don't know what pulp uses the AutoReference for though suspect it has to do with the REST API. But, for plugins fetching large result sets, this represents a pretty significant performance impact. For example, for node operations such as publishing and syncing, it means a difference of 30 seconds, vs 4 seconds to iterate the cursor(s) when fetching the content units associated with a repository. I attached the script used to do the benchmark.
Just learned that Katello folks routinely work with repositories containing 30k+ packages (Yes, the CDN has repositories that big). So, to put this in perspective, just to query the units in a repository that big will take ~92 seconds vs ~16 seconds (using really lazy math). This affects every distributor and heavily impacts node sync.
Pull request: https://github.com/pulp/pulp/pull/619
build: 2.3.0-0.14.alpha
verified [root@hp-sl2x160zg6-01 ~]# pulp-admin repo list +----------------------------------------------------------------------+ Repositories +----------------------------------------------------------------------+ Id: rhel6-4 Display Name: rhel6-4 Description: None Content Unit Counts: Distribution: 1 Erratum: 1956 Package Category: 10 Package Group: 202 Rpm: 11003 Yum Repo Metadata File: 1 [root@hp-sl2x160zg6-01 ~]# python benchmark.py duration: 9.683 (seconds) [root@hp-sl2x160zg6-01 ~]# <jortel> I only had 3178 and it took my little VM 2.2 seconds. on my same VM, it would have taken 7.616 seconds to fetch 11,003 rpms. seems about right. <jortel> without the fix, it would have taken ~90 seconds.
Pulp 2.3 released.