1198807 – Incremental update task never completes if one of the content hosts does not complete installation

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1198807 - Incremental update task never completes if one of the content hosts does not complete installation

Summary: Incremental update task never completes if one of the content hosts does not ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Content Management
Sub Component:
Version:	6.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	Unspecified
Assignee:	Justin Sherrill
QA Contact:	sthirugn@redhat.com
Docs Contact:
URL:	http://projects.theforeman.org/issues...
Whiteboard:
Depends On:	1223963
Blocks:	1130651
TreeView+	depends on / blocked

Reported:	2015-03-04 20:52 UTC by sthirugn@redhat.com
Modified:	2019-08-15 04:19 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-08-12 13:59:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description sthirugn@redhat.com 2015-03-04 20:52:55 UTC

Description of problem:
Incremental update task never completes if one of the content hosts does not complete installation

Version-Release number of selected component (if applicable):
Satellite-6.1.0-RHEL-6-20150303.0

How reproducible:
Always

Steps to Reproduce:
1. Create multiple content views, publish, promote
2. Register/subscribe multiple content hosts to different content view
3. During the incremental update progress, make one content host non-responsive to satellite by stopping the katello-agent or by some other means
3. Attempt to apply an applicable errata on different content hosts pertaining to different content views

Actual results:
The main 'Incremental Update' task waits forever since the child task corresponding to the non-responsive content-host never finished.

Expected results:
- Automatically time out the content hosts if they are non responsive to make the main 'Incremental Update' task complete
- The Output section of 'Incremental Update' task should show the errored content hosts so they can be manually fixed.

Additional info:

Comment 1 sthirugn@redhat.com 2015-03-04 20:53:47 UTC

Severity = High since this is a very likely customer scenario while having hundreds of content hosts.

Comment 2 RHEL Program Management 2015-03-05 02:26:35 UTC

Since this issue was entered in Red Hat Bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

Comment 4 Mike McCune 2015-03-12 23:32:04 UTC

Lets try and include the list of hosts that failed so we can tell what we need to resolve to complete the task.

Task timeout is way beyond scope as that will have possible unintended consequences and we should continue with our strategy of keeping all tasks pending until user intervention.

IMO, this is not a blocker, removing flag.

Comment 5 Justin Sherrill 2015-05-12 21:37:33 UTC

Created redmine issue http://projects.theforeman.org/issues/10489 from this bug

Comment 6 Justin Sherrill 2015-05-14 19:59:39 UTC

Two PRs for this:

https://github.com/Dynflow/dynflow/pull/151
https://github.com/Katello/katello/pull/5225

Comment 7 Bryan Kearney 2015-05-19 15:58:20 UTC

commit 956ac564de8fb668da4a0431615ef337c4f43bb3
Author: Justin Sherrill <jsherril>
Date:   Fri Mar 13 14:21:56 2015 -0400

    fixes #10489 - adding two timeouts for content tasks
    
    The first will fail the task if the client does not pick up the task
    this likely means that either the client is not running, goferd is not running,
    or goferd is having some sort of communication issue. (Default 20 seconds)
    
    The second will fail if the client has picked up the task but has not completed it
    after some time.  This could happen if the link is really really slow, or the client dies
    in the middle of a content action.  (Default to 60 minutes)
    
    (cherry picked from commit dc480f7b7d55506ab492b0776ed1f84d0a6e662e)

commit 9df19997f5c250b8388627aad6390430b364dd4d
Author: Ivan Nečas <inecas>
Date:   Thu May 14 16:57:50 2015 +0200

    Respect the time flow in managed clocks in tests
    
    Before this patch, the clock in tests were not respecting the
    time values, so some event to happen in far future would be triggered
    before the shot-term event. Also, when two events were pending in
    clock, when while processing the first event, some other event
    occurred, we were ignoring it. Now, the new event gets properly
    scheduled.
    
    (cherry picked from commit 5829fea479aa8f5d17b57f14c499640418a1dbc4)

commit faacbd4ae2f8def3632a79980f22835bfd90c6f8
Author: Justin Sherrill <jsherril>
Date:   Tue May 12 17:33:40 2015 -0400

    adding simple polling timeout mechanism
    
    This adds the ability for an action to call schedule_timeout(50), to
    schedule when the polling should stop and the task be marked as failed.
    In this simple implementation the above would cause the task to fail after 50
    seconds.  Actions could override process_timeout to do more advanced things
    
    (cherry picked from commit e26040b13646bf500b415dda26ccd7a31b06e2bf)

Comment 9 sthirugn@redhat.com 2015-05-21 19:43:28 UTC

Verification blocked by BZ 1223963

Comment 10 Mike McCune 2015-05-27 15:58:06 UTC

I'm 90% sure the fix for this is causing all package installs to fail with this error:

Exception:

NameError: undefined local variable or method `suspended_action' for #<Actions::Pulp::Consumer::ContentInstall:0x0000000e8ebcb0>

Backtrace:

/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.6/lib/dynflow/action/timeouts.rb:10:in `schedule_timeout'
/opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.43/app/lib/actions/pulp/consumer/content_install.rb:31:in `invoke_external_task'

Comment 11 Justin Sherrill 2015-05-27 17:26:04 UTC

Its failing because this commit is missing:

https://github.com/Dynflow/dynflow/commit/bfba7f30c91aa347628404596435863e4aa164b9

Comment 12 Justin Sherrill 2015-05-28 17:58:47 UTC

was fixed in snap6

Comment 13 sthirugn@redhat.com 2015-05-29 20:50:57 UTC

Verified as per the test steps below.

Version Tested: Satellite 6.1 GA Snap 6

Test Scenario 1: (what if goferd is not reachable in any of the content hosts?)
1. Create two content view
   - cv1 with entire rhel 7 contents
   - cv2 with rhel 7 contents (minus errata after Jan 1, 2015)
2. Publish/promote content views to respective environments
3. Register/Subscribe two content hosts chost1 and chost2 to cv1 and cv2 respectively
4. Stop goferd service in chost1
5. Now apply an errata which is installable in chost1 and applicable in chost2
6. Note the following:
- The 'Incremental Update' task is errored (this is not expected and will be fixed with https://bugzilla.redhat.com/show_bug.cgi?id=1226410)
- 'Install Applicable Errata' for chost1 failed.(as expected)
    - Go to the Task -> 'Task' tab does not print Output which will be fixed with https://bugzilla.redhat.com/show_bug.cgi?id=1198340 (for 6.2).  Use the following workaround to see the error and the content host information.
    - Go to the Task -> Errors -> "RuntimeError: Host did not respond within 20 seconds. Is katello-agent installed and goferd running on the Host?"
    - Go to the Task -> Errors -> "consumer_uuid"=>"91786640-074e-4ff0-8b9b-5c2f682284df". This shows the content host which got errored and admin can verify this content host manually.
- 'Install Applicable Errata' for chost2 passed.
7. (Say the content host is back again) Start goferd in cv1 - Errata installation completed.  But the prior failed 'Install Applicable Errata' task will still show State=Stopped; Result: warning (this is expected)


Test Scenario 2: (what if goferd is active in the content host but not able to finish installation in expected time?)
1. Perform Steps 1 through 3 from Test Scenario 1 
2. Make chost1 delay the installation by introducing a sleep() in katelloplugin.py or by shutting down the content host vm after it received the content install trigger from satellite
3. Now apply an errata which is installable in chost1 and applicable in chost2
4. Noted the following:
- The 'Incremental Update' task is errored (this is not expected and will be fixed with https://bugzilla.redhat.com/show_bug.cgi?id=1226410)
- 'Install Applicable Errata' for chost1 failed. (note the error message is different)
    -'Install Applicable Errata' task for chost1 will show State=Stopped; Result: warning (this is expected)
    - Go to the Task -> Errors -> "RuntimeError: Host did not finish content action in 300 seconds"
- 'Install Applicable Errata' for chost2 passed.

What's more:
You can configure the timeouts (in seconds) for both these processes:
Go to Administer -> Settings -> Katello tab:
content_action_accept_timeout - defaulted to 20 seconds
content_action_finish_timeout - defaulted to 3600 seconds

Comment 14 Bryan Kearney 2015-08-11 13:24:25 UTC

This bug is slated to be released with Satellite 6.1.

Comment 15 Bryan Kearney 2015-08-12 13:59:11 UTC

This bug was fixed in version 6.1.1 of Satellite which was released on 12 August, 2015.

Note You need to log in before you can comment on or make changes to this bug.