1829266 – Incremental update of the content view takes very long time to complete

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1829266 - Incremental update of the content view takes very long time to complete

Summary: Incremental update of the content view takes very long time to complete

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Content Views
Sub Component:
Version:	6.7.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	6.9.0
Assignee:	Partha Aji
QA Contact:	Danny Synk
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-29 10:19 UTC by Dhaval Joshi
Modified:	2024-10-01 16:34 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1919409 (view as bug list)
Environment:
Last Closed:	2021-04-21 13:14:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	30219	Normal	Rejected	Incremental update of the content view takes very long time to complete	2021-02-15 14:11:51 UTC
Foreman Issue Tracker	31427	Normal	Closed	Incremental update should depsolve only on repos containing units	2021-02-15 14:11:51 UTC
Pulp Redmine	7898	Normal	CLOSED - CURRENTRELEASE	No-op copy operations perform a lot of unnecessary work	2021-01-25 20:31:38 UTC
Red Hat Knowledge Base (Solution)	5395431	None	None	None	2020-12-07 17:36:19 UTC
Red Hat Product Errata	RHSA-2021:1313	None	None	None	2021-04-21 13:17:18 UTC

Description Dhaval Joshi 2020-04-29 10:19:50 UTC

Description of problem:
Incremental update of the content view takes very long to complete as compared to full content view publish task

Version-Release number of selected component (if applicable):
Satellite 6.7

How reproducible:
100%

Steps to Reproduce:
1. Create a content view of multiple RHEL repositories and publish it.
2. Run incremental update by adding one errata to the newly published content view.
3. Publish a newer version of the content view (full publish of CV)
4. Compare the time taken for both tasks.

Actual results:
Incremental update of the content view takes a much longer time to complete as compared to the full CV publish task.

Expected results:


Additional info:

Comment 12 Samir Jha 2020-06-25 16:05:41 UTC

Created redmine issue http://projects.theforeman.org/issues/30219 from this bug

Comment 13 Samir Jha 2020-06-25 16:35:55 UTC

Based on my investigation, this delay is caused by a change where we have now do dependency solving using sibling repos in the content view. Katello makes a call to pulp to copy contents from source repo to target repo and pass all sibling repos in the content view as additional_repos. As per the task exports, this call spawns a long running task in pulp. On the katello side, this is intended functionality. Moving to pulp to possibly optimize the task.

Comment 14 Pavel Moravec 2020-11-24 10:10:59 UTC

(In reply to Samir Jha from comment #13)
> Based on my investigation, this delay is caused by a change where we have
> now do dependency solving using sibling repos in the content view. Katello
> makes a call to pulp to copy contents from source repo to target repo and
> pass all sibling repos in the content view as additional_repos. As per the
> task exports, this call spawns a long running task in pulp. On the katello
> side, this is intended functionality. Moving to pulp to possibly optimize
> the task.

I confirm this behaviour. I noticed this on some bigger CV with multiple repos, where incremental update triggered many Actions::Pulp::Repository::CopyUnits tasks with additional_repos siblings, that:

- each task run for 30 minutes (that sounds a *lot*, for a dummy copying units among pulp repos; underneath mongo was not the bottleneck here)

- just very few pulp workers were contributing to the concurrent pulp.server.managers.repo.unit_association.associate_from_repo tasks (i.e. just 3 out of 8 workers; others were idle) - this sounds like an independent issue in resource_manager


If pulp devels will appreciate, I can prepare an environment with deterministic reproducer (repeatedly applicable) - just let me know.

Comment 16 Daniel Alley 2020-11-25 18:59:19 UTC

I have two possible suggested actionables, one for Pulp and one for Katello. The Pulp one is likely easier and can be implemented quickly.

First, some background on what happens when you do a depsolving-enabled copy in Pulp 2, simplified slightly:

* First the "source" repository is searched for content units that match the content that is specified for the copy. In this case, an errata.
* Then the contents of the source and destination repositories and all of the repositories specified as "additional repos" are read from the database into the dependency solver. This is a very expensive and I/O intensive operation. The depsolver needs to be loaded with the name, epoch-version-release, arch, and dependency information (requires, obsoletes, recommends, provides) as well as filelists of each and every package in all of the involved repositories.
* Then we tell the depsolver which packages we want to copy and start depsolving. This is actually quite fast compared to the process of loading the depsolver.
* Then we get an answer back on which content units are needed to make the transaction successful.
* Then we copy those content units into the repositories they need to go to.

By a large margin the most expensive operation here is the database I/O necessary to load the dependency solver with all of the relationships. This is already about as fast as we can make it, at least in MongoDB. As an aside, my experiments have shown that this is faster in Pulp 3, using PostgreSQL, than it is in Pulp 2 using MongoDB.

One architectural limitation of the way that Pulp 2 does "copy" is that in Pulp 2 this feature was designed from the beginning to have one source repo, and one destination repo, and the criteria for selecting content to be copied apply only to the source repository. Now that there can be multiple "source repositories" for content to be copied into multiple "destination repositories", this is a problem for some use cases, as it prevents a user from asking Pulp to copy content from multiple source repositories at once (referring to the search criteria - not indirect dependencies of the initially selected content). In order to accomplish this a user would need to make two separate copy requests.

My understanding (and I may be wrong, Justin or Partha should confirm this) is that when multi-repo copy was introduced, Katello made some changes to accomodate this, namely by doing exactly what I just described. I think that they round-robin pairs of repositories as the source and destination, providing the same set of copy search criteria, and providing all the other pairs of repos as additional repos. Looking at the Dynflow output, it looks like Katello is running copy operations in parallel for 12 different pairs of repositories, which is evidence that this is true.

This is extremely I/O intensive and extremely inefficient. Basically we are loading the exact same set of 12 repositories into the depsolver (loading all that information about every single package) 12 different times in parallel. And it scales with the number of repos in both directions -- more repos involved means each task takes longer, and it also means more tasks.

Pulp 3 avoids this by having a different copy API, one that allows arbitrary copies between arbitrary sets of source and destination repos without designating a "special" source and destination repo. It can all be done in one task.

However, for Pulp 2, I think there are still things we can do to mitigate this.

The Katello actionable would be to find a way to mitigate this. If you know which repository the errata that is being copied is coming from beforehand, then only submit a single copy task to Pulp with the appopriate source and destination and maybe additional repos set.

We can also work round this from the Pulp side. The errata is likely only present in one of those repositories, which means that when you tell Pulp to copy it, it will only be found in the source repo in one out of N tasks / repos. What Pulp should do that we aren't currently doing is to make sure we exit early if the set of content discovered in the "source" repo to be copied is empty. Right now, even if it's empty and there's nothing to copy, we load the solver anyway. Regardless of any optimizations found on the Katello side, we should probably do this anyway, as it will be simple and low-risk and have a real impact.

Comment 17 Daniel Alley 2020-11-25 19:11:45 UTC

Note that there are still limits to how fast we can make this given how much data needs to be transferred around. But reducing the # of tasks trying to do it at the same time will likely help a good deal.  And the way this is done in Pulp 3 will definitely help further.

Comment 18 Pavel Moravec 2020-11-25 21:22:23 UTC

The #c16 comment makes really much sense to me - all my observations (like # of "duplicated" dynflow steps, just very few pulp resource workers assigned to all the tasks,..) all those observations match that explanation. Thanks for the comment!

Katello definitely can know what repo the errata contains, that should be an easy request there to optimise on katello side, I think.

Comment 19 Daniel Alley 2020-11-25 22:08:33 UTC

If that information is very easy to know, with very little extra complication or work, then go for it.  If you have to do extra searching and calculation / api calls to know it, it may be easier to stick to handling it in Pulp. We can discuss further next week.

Comment 20 Justin Sherrill 2020-12-01 16:32:36 UTC

My initial thought is that we should just hold off one more release (6.10) to get pulp3 support and all the speed ups that come with it.  But i think this is a small change that we could get in fairly easily.  We can target it for 6.9.

Comment 21 Daniel Alley 2020-12-01 17:07:39 UTC

Grant got good results testing the pulp patch, we can just stick with that if you'd like to avoid change on the Katello side. It should be a significant improvement.

Comment 22 Robin Chan 2020-12-01 18:30:10 UTC

I've got no concerns over targeting just the Pulp change for 6.9 and then addressing any further performance improvements for 6.10. Sounds like a good plan.

Comment 23 Justin Sherrill 2020-12-01 18:48:25 UTC

The katello side change is very small, so i'm okay including that in place of or alongside the pulp change.

Comment 24 Partha Aji 2020-12-01 19:47:15 UTC

Connecting redmine issue https://projects.theforeman.org/issues/31427 from this bug

Comment 25 pulp-infra@redhat.com 2020-12-08 20:06:03 UTC

The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 26 pulp-infra@redhat.com 2020-12-08 20:06:05 UTC

The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 27 Justin Sherrill 2021-01-06 20:09:10 UTC

moving back to new for the katello change

Comment 28 Justin Sherrill 2021-01-06 20:50:05 UTC

oops the katello change is now upstream, moving back to post

Comment 29 pulp-infra@redhat.com 2021-01-25 20:31:39 UTC

The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.

Comment 30 Danny Synk 2021-03-23 20:53:34 UTC

Steps to test:

- Deploy a Satellite 6.9, Snap 17.
- Sync the rhel-7-server-rpms, rhel-server-rhscl-7-rpms, and rhel-server-satellite-tools-6.8-rpms repositories.
- Create a content view containing the above 3 repositories.
- Create an Yum Content Filter on the content view with Content Type set to "Erratum - By Date and Type," Inclusion Type set to "Include," Date Type set to "Updated On," and End Date set to 1 October 2020.
- Publish the content view.
- Add a single erratum to the content view using hammer.
- Publish a new version of the content view.

In testing this, I found that running an incremental update for a large content view such as the one described above still takes longer than publishing a full version of the content view. Publishing a full version took about 14 minutes, while an incremental update took about 43 minutes:

~~~
# time hammer content-view version incremental-update --content-view-version-id 15 --errata-ids 9766
[...................................................................................................................................] [100%]
Content View: rhel7 version 1.1
Added Content:
Errata:
RHBA-2021:0854
Packages:
dmidecode-3.2-5.el7_9.1.x86_64

real 43m27.142s
user 0m11.305s
sys 0m1.675s
~~~

However, running the same operation on Satellite 6.7.5 took 18 minutes longer than on Satellite 6.9. The following output is from a test of this on an content view containing the same repos as above on Satellite 6.7.5:

~~~
# time hammer content-view version incremental-update --content-view-version-id 14 --errata-ids 11169
[...................................................................................................................................] [100%]
Content View: rhel7 version 1.1
Added Content:
Errata:
RHBA-2021:0854
Packages:
dmidecode-3.2-5.el7_9.1.x86_64

real 61m28.148s
user 0m9.508s
sys 0m1.270s
~~~

I spoke with Partha about this, and he explained that the amount of time taken for an incremental update is an overall limitation of how this functionality is implemented in Pulp 2. The improvement between Satellite 6.7 and 6.9 is the result of limiting the number of calls to the computationally-expensive Actions::Pulp::Repository::CopyUnits subtask. On earlier Satellite versions, this subtask is called once for each repository in the content view, regardless of whether any errata being added by the incremental update are present in those repositories. On Satellite 6.9 and 6.8.4, this subtask is only called for repositories that contain errata being added by the incremental update. My 6.9 test Satellite shows only one CopyUnits subtask in the incremental update, while my 6.7 test Satellite shows three of these subtasks.

Since this change results in an incremental update that takes ~30% less time on Satellite 6.9 compared to Satellite 6.7, I am marking this BZ as VERIFIED on Satellite 6.9, snap 17.

Comment 33 errata-xmlrpc 2021-04-21 13:14:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.9 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1313

Note You need to log in before you can comment on or make changes to this bug.

ahumbe
bmbouter
dalley
dhjoshi
dsynk
ggainey
ipanova
johannes.grumboeck
jsherril
jturel
kkinge
kupadhya
ltran
mkalyat
mmccune
pmoravec
rbertolj
rchan
smajumda
sussen
ttereshc
zhunting