Bug 1867258

Summary: After upgrading to 6.7 and promoting content, Capsule sync is extremely slow
Product: Red Hat Satellite Reporter: Danny Synk <dsynk>
Component: Capsule - ContentAssignee: Justin Sherrill <jsherril>
Status: CLOSED ERRATA QA Contact: Vladimír Sedmík <vsedmik>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.7.0CC: aeladawy, avnkumar, dgross, hyu, jjeffers, jsherril, ktordeur, myoder, pcreech, pmoravec, rbertolj, wclark, wpinheir
Target Milestone: 6.7.4Keywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: tfm-rubygem-katello-3.14.0.30-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1874175 (view as bug list) Environment:
Last Closed: 2020-09-30 13:12:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Danny Synk 2020-08-07 21:47:15 UTC
Description of problem:

This is a variation on the recent issue with slow speeds during large Capsule syncs in Satellite/Capsule 6.7. In this instance, after upgrading from 6.6 to 6.7.2, a customer initiated an optimized Capsule sync from the Satellite webUI without changing the content, and the sync proceeded fairly quickly. However, after promoting new content view versions, a subsequent Capsule sync is proceeding extremely slowly; for example, one active sync task has been running for more than 20 hours, with Pulp workers still actively using CPU. The sync was initiated as an Optimized Sync from the Satellite webUI.


Version-Release number of selected component (if applicable):
6.7.2


Additional info:

The download policy on this Capsule was changed from Immediate to On Demand approximately one month ago.

Comment 4 Justin Sherrill 2020-08-17 19:08:46 UTC
Note that we think that this was caused by https://bugzilla.redhat.com/show_bug.cgi?id=1857359  going into 6.7.2.  The tradeoff here is speed of syncs versus the load on dynflow/foreman-tasks on the satellite.  You can achieve pre-6.7.2 syncs by adjusting the setting 'foreman_proxy_content_batch_size'  on the content tab within Administer > Settings.  If you adjust this value extremely large (10,000), this will cause capsule syncs to behave more like pre-6.7.2.  

I would advise moving the default to 100 or 200 before taking it up to a very large value though.  At very large batches (or pre-6.7.2), the amount of work can cause dynflow to not be able to process tasks fast enough due to the volume of tasks it is processing.  

To address the issue, we will do two things, change the default value to 100, and sort repositories by their rpm counts.  This will cause all the small repos to be synced first, hopefully grouping together the slower repos to sync.

Comment 7 Pavel Moravec 2020-08-18 20:42:35 UTC
I found something that applies to three different customers already, those linked via cases now. The cause sounds somewhere in pulp (or the way how katello manipulates with it).

I was checking the longest dynflow steps of a Capsule sync, via bash-fu:

f=6d2d814f-3e5d-4a29-9796-eca51fe97a70.html  # some html task export of one Caps sync

( grep -e "Real time:" -e ^repo_pulp_id: -e added_count: -e removed_count: -e updated_count: $f | sed "s/<p><b>Real//g" | cut -d: -f2 | sed 's/<\/b>/X/g' | sed 's/s<\/p>//g' | tr '\n' ' ' | tr 'X' '\n'; echo ) | sort -n > ${f}.stats

Then the *stats file contains lines like:

5838.44  REPONAME  0  -157  4453 

where 5838.44 is duration of the dynflow step, REPONAME is the pulp repo ID of the sync, 0 = added_count, -157 = removed_count and 4453 = updated_count from the pulp sync task details.


My observations:
- I noticed this phenomenon both for pulp repos of a CV/LE, but also for "base" / "root" repos (those in Library LE, synced from CDN or 3rd party - when Caps is assigned to Library)

- sometimes, removed_count is negative
  - how a _count_ can be negative? due to its setting at https://github.com/pulp/pulp/blob/2-master/server/pulp/server/controllers/repository.py#L1026
  - but that sounds weird, as usually added_count=0 here
  - so it means that after_count < initial_unit_count
  - how this happens? repository shrinks on its content??? I have no idea..

- longest repo synces often have highest updated_count (maybe since longest sync is for biggest repos that have somehow updated all their units, and all they appear in updated_count?)

- sometimes, two consecutive Caps synces - _without_ a change of relevant Sat repo - have _identical_ updated_count. Like repo sync detected some change, but it wasnt able to sync properly..?? or maybe first sync modified units in one way and the second one in reverse manner? So far, 3rd sync of such a repo always reports updated_count: 0, which supports the "alter something back and forward" theory

- in consequtive Caps synces that each takes ages, usually different repos take the longest time to sync. As in 3rd sync, some "slow repo" is "healed", but another one (e.g. newly CV/LE promoted) gets "broken"

At total, it seems like some change affects Satellite's repos, at different pace / time for different repos, and the change is propagated *twice* to Capsules - and due to the different pace per different repos, consequtive Caps synces are *all* slow, as different repos synces are slow in each such sync


I am building a reproducer on customer data from 02690401 (mongo dumps from Sat+Caps and one published repo dir), hopefully tomorrow I will have it ready.

Comment 8 Bryan Kearney 2020-08-19 20:00:49 UTC
Upstream bug assigned to jsherril

Comment 9 Bryan Kearney 2020-08-19 20:00:51 UTC
Upstream bug assigned to jsherril

Comment 11 Bryan Kearney 2020-08-24 20:00:47 UTC
Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/30654 has been resolved.

Comment 23 errata-xmlrpc 2020-09-30 13:12:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.7.4 Async Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4127

Comment 24 Red Hat Bugzilla 2023-09-14 06:05:29 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days