Description of problem: Incremental update of the content view takes very long to complete as compared to full content view publish task Version-Release number of selected component (if applicable): Satellite 6.7 How reproducible: 100% Steps to Reproduce: 1. Create a content view of multiple RHEL repositories and publish it. 2. Run incremental update by adding one errata to the newly published content view. 3. Publish a newer version of the content view (full publish of CV) 4. Compare the time taken for both tasks. Actual results: Incremental update of the content view takes a much longer time to complete as compared to the full CV publish task. Expected results: Additional info:
Created redmine issue http://projects.theforeman.org/issues/30219 from this bug
Based on my investigation, this delay is caused by a change where we have now do dependency solving using sibling repos in the content view. Katello makes a call to pulp to copy contents from source repo to target repo and pass all sibling repos in the content view as additional_repos. As per the task exports, this call spawns a long running task in pulp. On the katello side, this is intended functionality. Moving to pulp to possibly optimize the task.
(In reply to Samir Jha from comment #13) > Based on my investigation, this delay is caused by a change where we have > now do dependency solving using sibling repos in the content view. Katello > makes a call to pulp to copy contents from source repo to target repo and > pass all sibling repos in the content view as additional_repos. As per the > task exports, this call spawns a long running task in pulp. On the katello > side, this is intended functionality. Moving to pulp to possibly optimize > the task. I confirm this behaviour. I noticed this on some bigger CV with multiple repos, where incremental update triggered many Actions::Pulp::Repository::CopyUnits tasks with additional_repos siblings, that: - each task run for 30 minutes (that sounds a *lot*, for a dummy copying units among pulp repos; underneath mongo was not the bottleneck here) - just very few pulp workers were contributing to the concurrent pulp.server.managers.repo.unit_association.associate_from_repo tasks (i.e. just 3 out of 8 workers; others were idle) - this sounds like an independent issue in resource_manager If pulp devels will appreciate, I can prepare an environment with deterministic reproducer (repeatedly applicable) - just let me know.
I have two possible suggested actionables, one for Pulp and one for Katello. The Pulp one is likely easier and can be implemented quickly. First, some background on what happens when you do a depsolving-enabled copy in Pulp 2, simplified slightly: * First the "source" repository is searched for content units that match the content that is specified for the copy. In this case, an errata. * Then the contents of the source and destination repositories and all of the repositories specified as "additional repos" are read from the database into the dependency solver. This is a very expensive and I/O intensive operation. The depsolver needs to be loaded with the name, epoch-version-release, arch, and dependency information (requires, obsoletes, recommends, provides) as well as filelists of each and every package in all of the involved repositories. * Then we tell the depsolver which packages we want to copy and start depsolving. This is actually quite fast compared to the process of loading the depsolver. * Then we get an answer back on which content units are needed to make the transaction successful. * Then we copy those content units into the repositories they need to go to. By a large margin the most expensive operation here is the database I/O necessary to load the dependency solver with all of the relationships. This is already about as fast as we can make it, at least in MongoDB. As an aside, my experiments have shown that this is faster in Pulp 3, using PostgreSQL, than it is in Pulp 2 using MongoDB. One architectural limitation of the way that Pulp 2 does "copy" is that in Pulp 2 this feature was designed from the beginning to have one source repo, and one destination repo, and the criteria for selecting content to be copied apply only to the source repository. Now that there can be multiple "source repositories" for content to be copied into multiple "destination repositories", this is a problem for some use cases, as it prevents a user from asking Pulp to copy content from multiple source repositories at once (referring to the search criteria - not indirect dependencies of the initially selected content). In order to accomplish this a user would need to make two separate copy requests. My understanding (and I may be wrong, Justin or Partha should confirm this) is that when multi-repo copy was introduced, Katello made some changes to accomodate this, namely by doing exactly what I just described. I think that they round-robin pairs of repositories as the source and destination, providing the same set of copy search criteria, and providing all the other pairs of repos as additional repos. Looking at the Dynflow output, it looks like Katello is running copy operations in parallel for 12 different pairs of repositories, which is evidence that this is true. This is extremely I/O intensive and extremely inefficient. Basically we are loading the exact same set of 12 repositories into the depsolver (loading all that information about every single package) 12 different times in parallel. And it scales with the number of repos in both directions -- more repos involved means each task takes longer, and it also means more tasks. Pulp 3 avoids this by having a different copy API, one that allows arbitrary copies between arbitrary sets of source and destination repos without designating a "special" source and destination repo. It can all be done in one task. However, for Pulp 2, I think there are still things we can do to mitigate this. The Katello actionable would be to find a way to mitigate this. If you know which repository the errata that is being copied is coming from beforehand, then only submit a single copy task to Pulp with the appopriate source and destination and maybe additional repos set. We can also work round this from the Pulp side. The errata is likely only present in one of those repositories, which means that when you tell Pulp to copy it, it will only be found in the source repo in one out of N tasks / repos. What Pulp should do that we aren't currently doing is to make sure we exit early if the set of content discovered in the "source" repo to be copied is empty. Right now, even if it's empty and there's nothing to copy, we load the solver anyway. Regardless of any optimizations found on the Katello side, we should probably do this anyway, as it will be simple and low-risk and have a real impact.
Note that there are still limits to how fast we can make this given how much data needs to be transferred around. But reducing the # of tasks trying to do it at the same time will likely help a good deal. And the way this is done in Pulp 3 will definitely help further.
The #c16 comment makes really much sense to me - all my observations (like # of "duplicated" dynflow steps, just very few pulp resource workers assigned to all the tasks,..) all those observations match that explanation. Thanks for the comment! Katello definitely can know what repo the errata contains, that should be an easy request there to optimise on katello side, I think.
If that information is very easy to know, with very little extra complication or work, then go for it. If you have to do extra searching and calculation / api calls to know it, it may be easier to stick to handling it in Pulp. We can discuss further next week.
My initial thought is that we should just hold off one more release (6.10) to get pulp3 support and all the speed ups that come with it. But i think this is a small change that we could get in fairly easily. We can target it for 6.9.
Grant got good results testing the pulp patch, we can just stick with that if you'd like to avoid change on the Katello side. It should be a significant improvement.
I've got no concerns over targeting just the Pulp change for 6.9 and then addressing any further performance improvements for 6.10. Sounds like a good plan.
The katello side change is very small, so i'm okay including that in place of or alongside the pulp change.
Connecting redmine issue https://projects.theforeman.org/issues/31427 from this bug
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.
moving back to new for the katello change
oops the katello change is now upstream, moving back to post
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.
Steps to test: - Deploy a Satellite 6.9, Snap 17. - Sync the rhel-7-server-rpms, rhel-server-rhscl-7-rpms, and rhel-server-satellite-tools-6.8-rpms repositories. - Create a content view containing the above 3 repositories. - Create an Yum Content Filter on the content view with Content Type set to "Erratum - By Date and Type," Inclusion Type set to "Include," Date Type set to "Updated On," and End Date set to 1 October 2020. - Publish the content view. - Add a single erratum to the content view using hammer. - Publish a new version of the content view. In testing this, I found that running an incremental update for a large content view such as the one described above still takes longer than publishing a full version of the content view. Publishing a full version took about 14 minutes, while an incremental update took about 43 minutes: ~~~ # time hammer content-view version incremental-update --content-view-version-id 15 --errata-ids 9766 [...................................................................................................................................] [100%] Content View: rhel7 version 1.1 Added Content: Errata: RHBA-2021:0854 Packages: dmidecode-3.2-5.el7_9.1.x86_64 real 43m27.142s user 0m11.305s sys 0m1.675s ~~~ However, running the same operation on Satellite 6.7.5 took 18 minutes longer than on Satellite 6.9. The following output is from a test of this on an content view containing the same repos as above on Satellite 6.7.5: ~~~ # time hammer content-view version incremental-update --content-view-version-id 14 --errata-ids 11169 [...................................................................................................................................] [100%] Content View: rhel7 version 1.1 Added Content: Errata: RHBA-2021:0854 Packages: dmidecode-3.2-5.el7_9.1.x86_64 real 61m28.148s user 0m9.508s sys 0m1.270s ~~~ I spoke with Partha about this, and he explained that the amount of time taken for an incremental update is an overall limitation of how this functionality is implemented in Pulp 2. The improvement between Satellite 6.7 and 6.9 is the result of limiting the number of calls to the computationally-expensive Actions::Pulp::Repository::CopyUnits subtask. On earlier Satellite versions, this subtask is called once for each repository in the content view, regardless of whether any errata being added by the incremental update are present in those repositories. On Satellite 6.9 and 6.8.4, this subtask is only called for repositories that contain errata being added by the incremental update. My 6.9 test Satellite shows only one CopyUnits subtask in the incremental update, while my 6.7 test Satellite shows three of these subtasks. Since this change results in an incremental update that takes ~30% less time on Satellite 6.9 compared to Satellite 6.7, I am marking this BZ as VERIFIED on Satellite 6.9, snap 17.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Satellite 6.9 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1313