Bug 1448777 - Content sync/promotion fails to all capsules if one capsule is down
Summary: Content sync/promotion fails to all capsules if one capsule is down
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Capsule - Content
Version: 6.2.8
Hardware: x86_64
OS: Linux
high
high vote
Target Milestone: 6.2
Assignee: Brad Buckingham
QA Contact: Peter Ondrejka
URL:
Whiteboard:
Keywords: PrioBumpField, PrioBumpGSS, Triaged
: 1449862 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-08 04:38 UTC by cpatters
Modified: 2018-12-06 20:45 UTC (History)
15 users (show)

(edit)
Clone Of:
: 1480353 (view as bug list)
(edit)
Last Closed: 2017-09-25 18:59:44 UTC


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2803 normal SHIPPED_LIVE Satellite 6.2.12 bug fix update 2017-10-12 19:22:49 UTC
Foreman Issue Tracker 19659 None None None 2017-05-24 20:38 UTC

Description cpatters 2017-05-08 04:38:00 UTC
Description of problem:
If a content sync is underway, and any one capsule is down or becomes unavailable during this stage, the synchronisation to ALL capsules fails.  

The problem seems to be that the sync/publish/promote actions on Satellite perform the satellite component of the task first, and then in the last 5-10% of the job plan and start the capsule sync jobs. At this point if one capsule does not respond the entire task fails so none of the capsules begin syncing. 

At the time of the task failure, the Satellite components have completed, so from the Satellite view there is nothing to do. Package search shows new packages in the content view but they are not available on the capsules.

Version-Release number of selected component (if applicable):

Satellite 6.2.x with multiple capsules.


How reproducible:

Any time a capsule is unavailable and a sync/publish/promote is triggered. This has happened a few times as network connectivity to a remote capsule has been lost whilst a sync is in progress.

Also, if a capsule is down due to a known condition (e.g. remote site power outage) a sync on the remaining capsules cannot be started due to the one being down. The only way around this is to remove the capsule from the Satellite configuration, sync the rest, add the 'failed' capsule again and manually re-sync it when it has re-connected to Satellite. This is highly undesirable is a large enterprise environment.

Steps to Reproduce:
Environment is Satellite 6.2.x with at least 2 capsules, set them up to receive Library CV.
- Start a repo sync and whilst the sync is in progress disconnect one capsule from the network.
- Monitor the Satellite tasks - when the sync task reaches 95% (roughly) there should be new capsule sync tasks spawned in planning state.  At this point the main sync task will fail due to being unable to plan capsule sync on the failed capsule
- Note that capsule sync to the GOOD capsule is also not performed

Comment 2 Brad Buckingham 2017-05-24 20:37:12 UTC
*** Bug 1449862 has been marked as a duplicate of this bug. ***

Comment 3 Brad Buckingham 2017-05-24 20:38:35 UTC
Created redmine issue http://projects.theforeman.org/issues/19659 from this bug

Comment 4 Pavel Moravec 2017-06-19 07:38:43 UTC
Can't relatively easy fix be:

in /opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.0.124/app/lib/actions/katello/content_view/promote_to_environment.rb :

replace:

        def run
          environment = ::Katello::KTEnvironment.find(input[:environment_id])
          if ::Katello::CapsuleContent.sync_needed?(environment)
            ForemanTasks.async_task(ContentView::CapsuleGenerateAndSync,
                                    ::Katello::ContentView.find(input[:content_view_id]),
                                    environment)
          end
        rescue ::Katello::Errors::CapsuleCannotBeReached # skip any capsules that cannot be connected to
        end

by something like:


        def run
          environment = ::Katello::KTEnvironment.find(input[:environment_id])
          content_view = ::Katello::ContentView.find(input[:content_view_id])
          ::Katello::CapsuleContent.with_environment(environment).each do |capsule_content|
            ForemanTasks.async_task(Actions::Katello::CapsuleContent::Sync, capsule_content, :content_view => content_view, :environment => environment)
          end
        end

? I.e. invoke the underlying task independently for each Capsule.


But comparing old behaviour and new one, I lack dynflow steps

Actions::Pulp::Repository::UpdateImporter
Actions::Pulp::Repository::RefreshDistributor (twice)

that were present in original Actions::Katello::ContentView::CapsuleGenerateAndSync but miss in the new one (why?)

Comment 8 cpatters 2017-07-13 02:13:59 UTC
Is it possible to get an update on this BUG please, just want to keep the Customer informed on its progress.

Thanks

Ché

Comment 9 pm-sat@redhat.com 2017-07-18 04:00:51 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/19659 has been resolved.

Comment 10 Brad Buckingham 2017-07-31 16:44:08 UTC
Hello Ché,

I apologize, I did not intend to clear needinfo without responding. The fix for this one is currently targeted for Satellite 6.2.12; however, that is pending QE verification.

Comment 11 pm-sat@redhat.com 2017-08-03 22:00:51 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/19659 has been resolved.

Comment 14 Peter Ondrejka 2017-08-29 16:06:25 UTC
Verified in satellite-6.2.12-1.0.el7sat.noarch with three capsules. If one of the capsules becomes unavailable during content sync, the sync on other two is not affected

Comment 17 errata-xmlrpc 2017-09-25 18:59:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2803


Note You need to log in before you can comment on or make changes to this bug.