Description of problem: Incremental update task never completes if one of the content hosts does not complete installation Version-Release number of selected component (if applicable): Satellite-6.1.0-RHEL-6-20150303.0 How reproducible: Always Steps to Reproduce: 1. Create multiple content views, publish, promote 2. Register/subscribe multiple content hosts to different content view 3. During the incremental update progress, make one content host non-responsive to satellite by stopping the katello-agent or by some other means 3. Attempt to apply an applicable errata on different content hosts pertaining to different content views Actual results: The main 'Incremental Update' task waits forever since the child task corresponding to the non-responsive content-host never finished. Expected results: - Automatically time out the content hosts if they are non responsive to make the main 'Incremental Update' task complete - The Output section of 'Incremental Update' task should show the errored content hosts so they can be manually fixed. Additional info:
Severity = High since this is a very likely customer scenario while having hundreds of content hosts.
Since this issue was entered in Red Hat Bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release.
Lets try and include the list of hosts that failed so we can tell what we need to resolve to complete the task. Task timeout is way beyond scope as that will have possible unintended consequences and we should continue with our strategy of keeping all tasks pending until user intervention. IMO, this is not a blocker, removing flag.
Created redmine issue http://projects.theforeman.org/issues/10489 from this bug
Two PRs for this: https://github.com/Dynflow/dynflow/pull/151 https://github.com/Katello/katello/pull/5225
commit 956ac564de8fb668da4a0431615ef337c4f43bb3 Author: Justin Sherrill <jsherril> Date: Fri Mar 13 14:21:56 2015 -0400 fixes #10489 - adding two timeouts for content tasks The first will fail the task if the client does not pick up the task this likely means that either the client is not running, goferd is not running, or goferd is having some sort of communication issue. (Default 20 seconds) The second will fail if the client has picked up the task but has not completed it after some time. This could happen if the link is really really slow, or the client dies in the middle of a content action. (Default to 60 minutes) (cherry picked from commit dc480f7b7d55506ab492b0776ed1f84d0a6e662e) commit 9df19997f5c250b8388627aad6390430b364dd4d Author: Ivan Nečas <inecas> Date: Thu May 14 16:57:50 2015 +0200 Respect the time flow in managed clocks in tests Before this patch, the clock in tests were not respecting the time values, so some event to happen in far future would be triggered before the shot-term event. Also, when two events were pending in clock, when while processing the first event, some other event occurred, we were ignoring it. Now, the new event gets properly scheduled. (cherry picked from commit 5829fea479aa8f5d17b57f14c499640418a1dbc4) commit faacbd4ae2f8def3632a79980f22835bfd90c6f8 Author: Justin Sherrill <jsherril> Date: Tue May 12 17:33:40 2015 -0400 adding simple polling timeout mechanism This adds the ability for an action to call schedule_timeout(50), to schedule when the polling should stop and the task be marked as failed. In this simple implementation the above would cause the task to fail after 50 seconds. Actions could override process_timeout to do more advanced things (cherry picked from commit e26040b13646bf500b415dda26ccd7a31b06e2bf)
Verification blocked by BZ 1223963
I'm 90% sure the fix for this is causing all package installs to fail with this error: Exception: NameError: undefined local variable or method `suspended_action' for #<Actions::Pulp::Consumer::ContentInstall:0x0000000e8ebcb0> Backtrace: /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.7.6/lib/dynflow/action/timeouts.rb:10:in `schedule_timeout' /opt/rh/ruby193/root/usr/share/gems/gems/katello-2.2.0.43/app/lib/actions/pulp/consumer/content_install.rb:31:in `invoke_external_task'
Its failing because this commit is missing: https://github.com/Dynflow/dynflow/commit/bfba7f30c91aa347628404596435863e4aa164b9
was fixed in snap6
Verified as per the test steps below. Version Tested: Satellite 6.1 GA Snap 6 Test Scenario 1: (what if goferd is not reachable in any of the content hosts?) 1. Create two content view - cv1 with entire rhel 7 contents - cv2 with rhel 7 contents (minus errata after Jan 1, 2015) 2. Publish/promote content views to respective environments 3. Register/Subscribe two content hosts chost1 and chost2 to cv1 and cv2 respectively 4. Stop goferd service in chost1 5. Now apply an errata which is installable in chost1 and applicable in chost2 6. Note the following: - The 'Incremental Update' task is errored (this is not expected and will be fixed with https://bugzilla.redhat.com/show_bug.cgi?id=1226410) - 'Install Applicable Errata' for chost1 failed.(as expected) - Go to the Task -> 'Task' tab does not print Output which will be fixed with https://bugzilla.redhat.com/show_bug.cgi?id=1198340 (for 6.2). Use the following workaround to see the error and the content host information. - Go to the Task -> Errors -> "RuntimeError: Host did not respond within 20 seconds. Is katello-agent installed and goferd running on the Host?" - Go to the Task -> Errors -> "consumer_uuid"=>"91786640-074e-4ff0-8b9b-5c2f682284df". This shows the content host which got errored and admin can verify this content host manually. - 'Install Applicable Errata' for chost2 passed. 7. (Say the content host is back again) Start goferd in cv1 - Errata installation completed. But the prior failed 'Install Applicable Errata' task will still show State=Stopped; Result: warning (this is expected) Test Scenario 2: (what if goferd is active in the content host but not able to finish installation in expected time?) 1. Perform Steps 1 through 3 from Test Scenario 1 2. Make chost1 delay the installation by introducing a sleep() in katelloplugin.py or by shutting down the content host vm after it received the content install trigger from satellite 3. Now apply an errata which is installable in chost1 and applicable in chost2 4. Noted the following: - The 'Incremental Update' task is errored (this is not expected and will be fixed with https://bugzilla.redhat.com/show_bug.cgi?id=1226410) - 'Install Applicable Errata' for chost1 failed. (note the error message is different) -'Install Applicable Errata' task for chost1 will show State=Stopped; Result: warning (this is expected) - Go to the Task -> Errors -> "RuntimeError: Host did not finish content action in 300 seconds" - 'Install Applicable Errata' for chost2 passed. What's more: You can configure the timeouts (in seconds) for both these processes: Go to Administer -> Settings -> Katello tab: content_action_accept_timeout - defaulted to 20 seconds content_action_finish_timeout - defaulted to 3600 seconds
This bug is slated to be released with Satellite 6.1.
This bug was fixed in version 6.1.1 of Satellite which was released on 12 August, 2015.