Bug 1184630

Summary: Catch hung deployments with timeouts
Product: Red Hat OpenStack Reporter: Mike Burns <mburns>
Component: ruby193-rubygem-staypuftAssignee: Jiri Stransky <jstransk>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: urgent Docs Contact:
Priority: urgent    
Version: unspecifiedCC: aberezin, ajeain, dcleal, dmacpher, jstransk, juwu, mburns, ohochman, rhos-maint, sasha, sclewis, sgordon, sputhenp, yeylon
Target Milestone: ga   
Target Release: Installer   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ruby193-rubygem-staypuft-0.5.15-1.el7ost Doc Type: Bug Fix
Doc Text:
Various bugs and external factors caused the deployment to hang (i.e. the deployment stopped but the UI showed the deployment was still in progress). This fix implements timeouts for parts of the deployment process where the execution waits for a condition to succeed. If a bug or an external factor causes the deployment to hang, the deployment now shows a failure status after the timeout expires.
Story Points: ---
Clone Of: 1182581 Environment:
Last Closed: 2015-02-09 15:20:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1173634, 1182581    
Bug Blocks: 1177026    

Description Mike Burns 2015-01-21 21:04:27 UTC
The goal here is to mitigate the impact of bug 1182581.  We need to have a timeout on the steps so that we can inform the user that the system may be hung.

+++ This bug was initially created as a clone of Bug #1182581 +++

This is the split of the second issue mentioned in but 1173634.  It has to do with a failure in postgres causing dynflow to hang forever.

Workaround is to manually restart foreman-tasks and resume the deployment in the UI.

+++ This bug was initially created as a clone of Bug #1173634 +++

rubygem-staypuft: Deployment runs without completing/failing for more than 17 hours.

Environment:
openstack-puppet-modules-2014.2.7-1.el7ost.noarch
rhel-osp-installer-0.5.2-2.el7ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch
rhel-osp-installer-client-0.5.2-2.el7ost.noarch
openstack-foreman-installer-3.0.5-1.el7ost.noarch
ruby193-rubygem-staypuft-0.5.4-1.el7ost.noarch


Steps to reproduce:
1. Install rhel-osp-installer
2. Create/start HAneutron deployment with 3 controllers and 1 compute host.


Result:

The deployment shows as running and doesn't complete nor does it fail.

Expected result:
The deployment should complete successfully (or fail due to some particular error).

--- Additional comment from Omri Hochman on 2015-01-14 14:54:48 EST ---

Failed-qa  :  We got this bug reproduced on two different environment (neutron-gre-ha) . 

In both cases we saw that: on 1 out of 3 controllers - the puppet agent wasn't triggered even once after the OS was provisioned --> that caused 'pcs' not to be installed on this controller - which reports the following error (reported in comment #21) : 

change from notrun to 0 failed: /usr/sbin/pcs cluster auth pcmk-maca25400702875 pcmk-maca25400702876 pcmk-maca25400702877 -u hacluster -p CHANGEME --force returned 1 instead of one of [0]

Despite this error, since puppet was not triggered once (or rerun 3 times) the deployment status in staypuft GUI remains 'Deploying..' for 17 hours.  



Environment:
-------------
rhel-osp-installer-0.5.5-1.el7ost.noarch
ruby193-rubygem-staypuft-0.5.11-1.el7ost.noarch
foreman-1.6.0.49-4.el7ost.noarch
rubygem-foreman_api-0.1.11-6.el7sat.noarch
openstack-foreman-installer-3.0.9-1.el7ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch
foreman-installer-1.6.0-0.2.RC1.el7ost.noarch
openstack-puppet-modules-2014.2.8-1.el7ost.noarch
puppet-3.6.2-2.el7.noarch
puppet-server-3.6.2-2.el7.noarch

--- Additional comment from Omri Hochman on 2015-01-14 15:01:39 EST ---

dynflow view :
----------------
Note:
- Task 54 is 'suspended'
- Task 62 is 'pending' 

49: Actions::Staypuft::Host::ReportWait (success) [ 4706.58s / 10.03s ]
52: Actions::Staypuft::Host::PuppetRun (success) [ 0.02s / 0.02s ]
54: Actions::Staypuft::Host::ReportWait (suspended) [ 12850.11s / 24.37s ]
57: Actions::Staypuft::Host::PuppetRun (success) [ 0.32s / 0.32s ]
59: Actions::Staypuft::Host::ReportWait (success) [ 4761.79s / 9.13s ]
62: Actions::Staypuft::Host::PuppetRun (pending)
64: Actions::Staypuft::Host::ReportWait (pending)


Task 54 :
---------
54: Actions::Staypuft::Host::ReportWait (suspended) [ 12850.11s / 24.37s ]
Started at: 2015-01-14 15:49:49 UTC
Ended at: 2015-01-14 19:23:59 UTC
Real time: 12850.11s
Execution time (excluding suspended state): 24.37s

Input:
---
host_id: 3
after: '2015-01-14T10:49:49-05:00'
current_user_id: 3

Output:
---
status: false
poll_attempts:
  total: 2563
  failed: 0

Task 62:
---------
Started at:
Ended at:
Real time: 0.00
Execution time (excluding suspended state): 0.00s

Input:
---
host_id: 4
name: maca25400702877.example.com
current_user_id: 3

Output:
---
 {}

--- Additional comment from Omri Hochman on 2015-01-14 15:13:10 EST ---

Adding production.log :
hoping it will help us understand why puppet wasn't triggered even once on that controller - after the provisioned was over.

BTW puppet is installed on that machine.
rpm -qa | grep puppet
puppet-3.6.2-2.el7.noarch

--- Additional comment from Jiri Stransky on 2015-01-15 05:28:11 EST ---

Investigated one of the environments. It seems the cause might be yet another race condition. We check if a host is ready after provisioning and reboot by looking if the ssh port is open, but that doesn't seem to be enough. Sometimes even if the ssh port is open, the ssh connection for puppetssh apparently won't get established.

Foreman Proxy only allows us to trigger puppetssh runs on the host and not run arbitrary commands and see their result, so we'll have to fix this by giving the hosts some time to finish booting up after we detect that ssh port is open.

Even in the case where the puppet run *does* work, the first puppetssh connection is established ("Started Session 1 of user root." in syslog) 30 seconds before systemd prints "Reached target Multi-User System.". So giving the hosts some additional time would be desirable it seems.


--- Additional comment from Jiri Stransky on 2015-01-15 05:37:29 EST ---

Attached some log files from the deployment in question. (The deployer triggered puppet run on the problematic host manually.)

--- Additional comment from RHEL Product and Program Management on 2015-01-15 09:12:19 EST ---

Since this issue was entered in bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.


--- Additional comment from Jiri Stransky on 2015-01-15 10:24:02 EST ---

Relevant output from /var/log/foreman/production.log:

Started GET "/reports?search=eventful+%3D+true" for 10.10.50.242 at 2015-01-14 15:52:52 -0500
Processing by ReportsController#index as HTML
  Parameters: {"search"=>"eventful = true"}
  Rendered reports/_list.html.erb (15.9ms)
  Rendered reports/index.html.erb within layouts/application (17.0ms)
  Rendered common/_searchbar.html.erb (2.8ms)
  Rendered home/_user_dropdown.html.erb (1.2ms)
Read fragment views/tabs_and_title_records-3 (0.1ms)
  Rendered home/_topbar.html.erb (1.8ms)
  Rendered layouts/base.html.erb (3.0ms)
Completed 200 OK in 32ms (Views: 24.3ms | ActiveRecord: 1.9ms)
stream closed (IOError)
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:149:in `async_exec'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:149:in `block in execute_query'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/logging.rb:33:in `log_yield'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:149:in `execute_query'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:136:in `block in execute'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:115:in `check_disconnect_errors'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:136:in `execute'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:418:in `_execute'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:246:in `block (2 levels) in execute'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:430:in `check_database_errors'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:246:in `block in execute'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/connecting.rb:236:in `block in synchronize'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/connection_pool/threaded.rb:104:in `hold'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/connecting.rb:236:in `synchronize'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:246:in `execute'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/query.rb:79:in `execute_dui'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/dataset/actions.rb:861:in `execute_dui'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/dataset/actions.rb:774:in `update'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/persistence_adapters/sequel.rb:108:in `save'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/persistence_adapters/sequel.rb:56:in `save_step'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/persistence.rb:46:in `save_step'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract.rb:62:in `save'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/action.rb:266:in `save_state'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/action.rb:435:in `execute_run'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/action.rb:230:in `execute'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:9:in `block (2 levels) in execute'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract.rb:152:in `call'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract.rb:152:in `with_meta_calculation'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:8:in `block in execute'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:22:in `open_action'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:7:in `execute'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/executors/parallel/worker.rb:20:in `block in on_message'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:859:in `block in assigns'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:858:in `tap'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:858:in `assigns'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:138:in `match_value'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:116:in `block in match'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:115:in `each'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:115:in `match'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/executors/parallel/worker.rb:17:in `on_message'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:82:in `on_envelope'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:72:in `receive'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `block (2 levels) in run'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `loop'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `block in run'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `catch'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `run'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:13:in `block in initialize'


--- Additional comment from Alexander Chuzhoy on 2015-01-16 14:29:00 EST ---

Reproduced on BM environment:
The puppet didn't run after the OS was provisioned.
A host was "waituntilready" in dynamic flow.
restarting the foreman-tasks service on that system manually caused the puppet to run on the controllers.

--- Additional comment from Mike Burns on 2015-01-16 14:42:03 EST ---

The latest occurrence was doing a "WaitUntilReady" step.  This step polls the host for the ssh port to be open.  The host was completely up, we could ssh in, etc.  

Workaround used:  restart foreman-tasks service.  Resume the deployment (resume dynflow).  Host was immediately recognized as up and the deployment succeeded.



--- Additional comment from Jiri Stransky on 2015-01-19 12:34:46 EST ---

Reproduced again on a QE lab machine, this time a different action was running at the time of the exception - the PuppetRun action.

Comment 2 Jiri Stransky 2015-01-23 13:40:02 UTC
Timeouts pull request upstream:

https://github.com/theforeman/staypuft/pull/410

As Mike mentioned in the whiteboard, timing out cannot mitigate the IOError issue but can mitigate other issues.

Comment 3 Mike Burns 2015-01-23 16:13:28 UTC
Timeouts won't mitigate the issue in bug 1182581, but they're still generically useful catching similar issues, so lets include anyway.

Comment 5 Alexander Chuzhoy 2015-01-30 18:52:06 UTC
Verified:
Environment:
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch
openstack-foreman-installer-3.0.13-1.el7ost.noarch
ruby193-rubygem-staypuft-0.5.15-1.el7ost.noarch
rhel-osp-installer-client-0.5.5-2.el7ost.noarch
openstack-puppet-modules-2014.2.8-2.el7ost.noarch
rhel-osp-installer-0.5.5-2.el7ost.noarch



The deployment will fail after ~2.5 hours with the following message in the dynflow console:
ERF42-7017 [Staypuft::Exception]: You've reached the timeout set for this action. If the action is still ongoing, you can click on the "Resume Deployment" button to continue.

Comment 7 errata-xmlrpc 2015-02-09 15:20:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0156.html