1867258 – After upgrading to 6.7 and promoting content, Capsule sync is extremely slow

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1867258 - After upgrading to 6.7 and promoting content, Capsule sync is extremely slow

Summary: After upgrading to 6.7 and promoting content, Capsule sync is extremely slow

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Capsule - Content
Sub Component:
Version:	6.7.0
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	6.7.4
Assignee:	Justin Sherrill
QA Contact:	Vladimír Sedmík
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-07 21:47 UTC by Danny Synk
Modified:	2024-12-20 19:11 UTC (History)
CC List:	13 users (show)
Fixed In Version:	tfm-rubygem-katello-3.14.0.30-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1874175 (view as bug list)
Environment:
Last Closed:	2020-09-30 13:12:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	30654	Normal	Closed	batch smart proxy content syncing causing slower syncs	2021-02-11 21:48:01 UTC
Red Hat Knowledge Base (Solution)	5332321	None	None	None	2020-08-20 01:01:33 UTC
Red Hat Product Errata	RHSA-2020:4127	None	None	None	2020-09-30 13:12:21 UTC

Description Danny Synk 2020-08-07 21:47:15 UTC

Description of problem:

This is a variation on the recent issue with slow speeds during large Capsule syncs in Satellite/Capsule 6.7. In this instance, after upgrading from 6.6 to 6.7.2, a customer initiated an optimized Capsule sync from the Satellite webUI without changing the content, and the sync proceeded fairly quickly. However, after promoting new content view versions, a subsequent Capsule sync is proceeding extremely slowly; for example, one active sync task has been running for more than 20 hours, with Pulp workers still actively using CPU. The sync was initiated as an Optimized Sync from the Satellite webUI.


Version-Release number of selected component (if applicable):
6.7.2


Additional info:

The download policy on this Capsule was changed from Immediate to On Demand approximately one month ago.

Comment 4 Justin Sherrill 2020-08-17 19:08:46 UTC

Note that we think that this was caused by https://bugzilla.redhat.com/show_bug.cgi?id=1857359  going into 6.7.2.  The tradeoff here is speed of syncs versus the load on dynflow/foreman-tasks on the satellite.  You can achieve pre-6.7.2 syncs by adjusting the setting 'foreman_proxy_content_batch_size'  on the content tab within Administer > Settings.  If you adjust this value extremely large (10,000), this will cause capsule syncs to behave more like pre-6.7.2.  

I would advise moving the default to 100 or 200 before taking it up to a very large value though.  At very large batches (or pre-6.7.2), the amount of work can cause dynflow to not be able to process tasks fast enough due to the volume of tasks it is processing.  

To address the issue, we will do two things, change the default value to 100, and sort repositories by their rpm counts.  This will cause all the small repos to be synced first, hopefully grouping together the slower repos to sync.

Comment 7 Pavel Moravec 2020-08-18 20:42:35 UTC

I found something that applies to three different customers already, those linked via cases now. The cause sounds somewhere in pulp (or the way how katello manipulates with it).

I was checking the longest dynflow steps of a Capsule sync, via bash-fu:

f=6d2d814f-3e5d-4a29-9796-eca51fe97a70.html  # some html task export of one Caps sync

( grep -e "Real time:" -e ^repo_pulp_id: -e added_count: -e removed_count: -e updated_count: $f | sed "s/<p><b>Real//g" | cut -d: -f2 | sed 's/<\/b>/X/g' | sed 's/s<\/p>//g' | tr '\n' ' ' | tr 'X' '\n'; echo ) | sort -n > ${f}.stats

Then the *stats file contains lines like:

5838.44  REPONAME  0  -157  4453 

where 5838.44 is duration of the dynflow step, REPONAME is the pulp repo ID of the sync, 0 = added_count, -157 = removed_count and 4453 = updated_count from the pulp sync task details.


My observations:
- I noticed this phenomenon both for pulp repos of a CV/LE, but also for "base" / "root" repos (those in Library LE, synced from CDN or 3rd party - when Caps is assigned to Library)

- sometimes, removed_count is negative
  - how a _count_ can be negative? due to its setting at https://github.com/pulp/pulp/blob/2-master/server/pulp/server/controllers/repository.py#L1026
  - but that sounds weird, as usually added_count=0 here
  - so it means that after_count < initial_unit_count
  - how this happens? repository shrinks on its content??? I have no idea..

- longest repo synces often have highest updated_count (maybe since longest sync is for biggest repos that have somehow updated all their units, and all they appear in updated_count?)

- sometimes, two consecutive Caps synces - _without_ a change of relevant Sat repo - have _identical_ updated_count. Like repo sync detected some change, but it wasnt able to sync properly..?? or maybe first sync modified units in one way and the second one in reverse manner? So far, 3rd sync of such a repo always reports updated_count: 0, which supports the "alter something back and forward" theory

- in consequtive Caps synces that each takes ages, usually different repos take the longest time to sync. As in 3rd sync, some "slow repo" is "healed", but another one (e.g. newly CV/LE promoted) gets "broken"

At total, it seems like some change affects Satellite's repos, at different pace / time for different repos, and the change is propagated *twice* to Capsules - and due to the different pace per different repos, consequtive Caps synces are *all* slow, as different repos synces are slow in each such sync


I am building a reproducer on customer data from 02690401 (mongo dumps from Sat+Caps and one published repo dir), hopefully tomorrow I will have it ready.

Comment 8 Bryan Kearney 2020-08-19 20:00:49 UTC

Upstream bug assigned to jsherril

Comment 9 Bryan Kearney 2020-08-19 20:00:51 UTC

Upstream bug assigned to jsherril

Comment 11 Bryan Kearney 2020-08-24 20:00:47 UTC

Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/30654 has been resolved.

Comment 23 errata-xmlrpc 2020-09-30 13:12:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.7.4 Async Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4127

Comment 24 Red Hat Bugzilla 2023-09-14 06:05:29 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.