Bug 1867258
Summary: | After upgrading to 6.7 and promoting content, Capsule sync is extremely slow | |||
---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Danny Synk <dsynk> | |
Component: | Capsule - Content | Assignee: | Justin Sherrill <jsherril> | |
Status: | CLOSED ERRATA | QA Contact: | Vladimír Sedmík <vsedmik> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 6.7.0 | CC: | aeladawy, avnkumar, dgross, hyu, jjeffers, jsherril, ktordeur, myoder, pcreech, pmoravec, rbertolj, wclark, wpinheir | |
Target Milestone: | 6.7.4 | Keywords: | Triaged | |
Target Release: | Unused | |||
Hardware: | Unspecified | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | tfm-rubygem-katello-3.14.0.30-1 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1874175 (view as bug list) | Environment: | ||
Last Closed: | 2020-09-30 13:12:11 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Danny Synk
2020-08-07 21:47:15 UTC
Note that we think that this was caused by https://bugzilla.redhat.com/show_bug.cgi?id=1857359 going into 6.7.2. The tradeoff here is speed of syncs versus the load on dynflow/foreman-tasks on the satellite. You can achieve pre-6.7.2 syncs by adjusting the setting 'foreman_proxy_content_batch_size' on the content tab within Administer > Settings. If you adjust this value extremely large (10,000), this will cause capsule syncs to behave more like pre-6.7.2. I would advise moving the default to 100 or 200 before taking it up to a very large value though. At very large batches (or pre-6.7.2), the amount of work can cause dynflow to not be able to process tasks fast enough due to the volume of tasks it is processing. To address the issue, we will do two things, change the default value to 100, and sort repositories by their rpm counts. This will cause all the small repos to be synced first, hopefully grouping together the slower repos to sync. I found something that applies to three different customers already, those linked via cases now. The cause sounds somewhere in pulp (or the way how katello manipulates with it). I was checking the longest dynflow steps of a Capsule sync, via bash-fu: f=6d2d814f-3e5d-4a29-9796-eca51fe97a70.html # some html task export of one Caps sync ( grep -e "Real time:" -e ^repo_pulp_id: -e added_count: -e removed_count: -e updated_count: $f | sed "s/<p><b>Real//g" | cut -d: -f2 | sed 's/<\/b>/X/g' | sed 's/s<\/p>//g' | tr '\n' ' ' | tr 'X' '\n'; echo ) | sort -n > ${f}.stats Then the *stats file contains lines like: 5838.44 REPONAME 0 -157 4453 where 5838.44 is duration of the dynflow step, REPONAME is the pulp repo ID of the sync, 0 = added_count, -157 = removed_count and 4453 = updated_count from the pulp sync task details. My observations: - I noticed this phenomenon both for pulp repos of a CV/LE, but also for "base" / "root" repos (those in Library LE, synced from CDN or 3rd party - when Caps is assigned to Library) - sometimes, removed_count is negative - how a _count_ can be negative? due to its setting at https://github.com/pulp/pulp/blob/2-master/server/pulp/server/controllers/repository.py#L1026 - but that sounds weird, as usually added_count=0 here - so it means that after_count < initial_unit_count - how this happens? repository shrinks on its content??? I have no idea.. - longest repo synces often have highest updated_count (maybe since longest sync is for biggest repos that have somehow updated all their units, and all they appear in updated_count?) - sometimes, two consecutive Caps synces - _without_ a change of relevant Sat repo - have _identical_ updated_count. Like repo sync detected some change, but it wasnt able to sync properly..?? or maybe first sync modified units in one way and the second one in reverse manner? So far, 3rd sync of such a repo always reports updated_count: 0, which supports the "alter something back and forward" theory - in consequtive Caps synces that each takes ages, usually different repos take the longest time to sync. As in 3rd sync, some "slow repo" is "healed", but another one (e.g. newly CV/LE promoted) gets "broken" At total, it seems like some change affects Satellite's repos, at different pace / time for different repos, and the change is propagated *twice* to Capsules - and due to the different pace per different repos, consequtive Caps synces are *all* slow, as different repos synces are slow in each such sync I am building a reproducer on customer data from 02690401 (mongo dumps from Sat+Caps and one published repo dir), hopefully tomorrow I will have it ready. Upstream bug assigned to jsherril Upstream bug assigned to jsherril Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/30654 has been resolved. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Satellite 6.7.4 Async Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4127 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |