Bug 1853076

Summary: large capsule syncs cause slow processing of dynflow tasks/steps
Product: Red Hat Satellite Reporter: Waldirio M Pinheiro <wpinheir>
Component: Capsule - ContentAssignee: Justin Sherrill <jsherril>
Status: CLOSED ERRATA QA Contact: Vladimír Sedmík <vsedmik>
Severity: high Docs Contact:
Priority: high    
Version: 6.7.0CC: ahumbe, arahaman, avnkumar, dhjoshi, dsynk, ehelms, fperalta, hyu, iballou, ikaur, jhutar, joboyer, jsherril, ktordeur, rbertolj, saydas, smajumda, wclark
Target Milestone: 6.8.0Keywords: PrioBumpGSS, Triaged
Target Release: Unused   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: rubygem-katello-3.16.0-0.16.rc4.1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1857359 (view as bug list) Environment:
Last Closed: 2020-10-27 13:03:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
HOTFIX RPM for Satellite 6.7.1 none

Description Waldirio M Pinheiro 2020-07-01 21:35:46 UTC
Description of problem:
After upgrade to Satellite 6.7, a lot of issues related to sync spending a long time to finish and dynflow consuming a lot of memory.

Version-Release number of selected component (if applicable):
6.7

How reproducible:
100%

Steps to Reproduce:
1. Sync a lot of repos
2. Still pushing the sync
3.

Actual results:
Dynflow consuming a lot and Capsule Sync task spending a lot of time

Expected results:
Be fast and with no fail

Additional info:

Comment 1 Justin Sherrill 2020-07-02 01:20:27 UTC
Created redmine issue https://projects.theforeman.org/issues/30286 from this bug

Comment 2 Francisco Peralta 2020-07-06 07:52:26 UTC
Dear Team,
 is there actually a workaround available for this issue? 

 My customer is also facing it and would like to understand what could be an ETA for a (hot)fix?

Thanks in advance,
 Cisco.

Comment 7 Bryan Kearney 2020-07-06 20:02:55 UTC
Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/30286 has been resolved.

Comment 9 wclark 2020-07-07 13:22:33 UTC
Created attachment 1700153 [details]
HOTFIX RPM for Satellite 6.7.1

Comment 10 wclark 2020-07-07 13:27:45 UTC
HOTFIX is attached. Please find installation instructions below:

1. Take a backup or snapshot of Satellite server

2. Download the Hotfix RPM and copy it to Satellite server

3. # yum install tfm-rubygem-katello-3.14.0.21-6.HOTFIXRHBZ1830403RHBZ1789911RHBZ1853076.el7sat.noarch.rpm --disableplugin=foreman-protector

4. # systemctl restart httpd dynflow

By default, the Hotfix will configure a batch size of 25 for Pulp sub-tasks during Capsule sync. The effect is that it will reduce the necessary amount of polling of Dynflow --> Pulp, reducing the load on both services as neither needs to track nor communicate with the other about 1000s of sub-tasks.

The batch size is also configurable so you may find a more optimal value for your deployment. To configure it, navigate to Administer --> Settings --> Content --> modify the parameter labeled "Batch size to sync repositories in."

Comment 13 Jan Hutař 2020-09-08 06:03:58 UTC
I'm very sorry Vláďo, I was not able to work on this :-/

Comment 14 Vladimír Sedmík 2020-09-14 10:51:03 UTC
To verify this BZ I was comparing two setups:
1) Satellite + Capsule 6.7.0 snap 20
2) Satellite + Capsule 6.8.0 snap 14

In each setup 6 repos (RHEL7Server, RHEL7Server-Optional, RHSCL for RHEL7, RHEL8-BaseOS, RHEL8-AppStream, test_simple_errata) were published into 40 content views each (240 content views in total) and were synchronized (Complete Sync) from Sat to Caps. I used the default batch size settings 'foreman_proxy_batch_size'=25 in case 2). Four hosts were registered and unregistered through the capsule.

Results:
-------------------------------------------------------------------
					6.7.0-20	6.8.0-14
-------------------------------------------------------------------
Overall sync time [hh:mm:ss]		28:44:45	26:54:33
Host registration time			13 sec		11 sec
Host unregistration time		2 s		2 s
Average errata enumeration time		163 s		19.5 s
Average CPU load during sync		1.88		1.89
Median CPU load during sync		0.87		0.22
REX command run time (hostnamectl)	27-44 s		4-8 s
-------------------------------------------------------------------

Conclusion: We can see huge improvement in the Errata enumeration time for new registered hosts (need to use workaround of BZ#1771921) and also REX times improved significantly. Overall sync time has improved slightly (by 1h50m) while the average CPU load remained almost the same. Mean CPU load was lower at 6.8 as higher peaks and longer valleys occurred during the sync.

I haven't noticed any large or fast-growing log files during or after the sync. The size of /var/log/foreman was 3.5MB and whole /var/log directory occupied ~1G of space on both instances after the sync.

Comment 17 errata-xmlrpc 2020-10-27 13:03:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.8 release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4366