The goal here is to mitigate the impact of bug 1182581. We need to have a timeout on the steps so that we can inform the user that the system may be hung. +++ This bug was initially created as a clone of Bug #1182581 +++ This is the split of the second issue mentioned in but 1173634. It has to do with a failure in postgres causing dynflow to hang forever. Workaround is to manually restart foreman-tasks and resume the deployment in the UI. +++ This bug was initially created as a clone of Bug #1173634 +++ rubygem-staypuft: Deployment runs without completing/failing for more than 17 hours. Environment: openstack-puppet-modules-2014.2.7-1.el7ost.noarch rhel-osp-installer-0.5.2-2.el7ost.noarch ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch rhel-osp-installer-client-0.5.2-2.el7ost.noarch openstack-foreman-installer-3.0.5-1.el7ost.noarch ruby193-rubygem-staypuft-0.5.4-1.el7ost.noarch Steps to reproduce: 1. Install rhel-osp-installer 2. Create/start HAneutron deployment with 3 controllers and 1 compute host. Result: The deployment shows as running and doesn't complete nor does it fail. Expected result: The deployment should complete successfully (or fail due to some particular error). --- Additional comment from Omri Hochman on 2015-01-14 14:54:48 EST --- Failed-qa : We got this bug reproduced on two different environment (neutron-gre-ha) . In both cases we saw that: on 1 out of 3 controllers - the puppet agent wasn't triggered even once after the OS was provisioned --> that caused 'pcs' not to be installed on this controller - which reports the following error (reported in comment #21) : change from notrun to 0 failed: /usr/sbin/pcs cluster auth pcmk-maca25400702875 pcmk-maca25400702876 pcmk-maca25400702877 -u hacluster -p CHANGEME --force returned 1 instead of one of [0] Despite this error, since puppet was not triggered once (or rerun 3 times) the deployment status in staypuft GUI remains 'Deploying..' for 17 hours. Environment: ------------- rhel-osp-installer-0.5.5-1.el7ost.noarch ruby193-rubygem-staypuft-0.5.11-1.el7ost.noarch foreman-1.6.0.49-4.el7ost.noarch rubygem-foreman_api-0.1.11-6.el7sat.noarch openstack-foreman-installer-3.0.9-1.el7ost.noarch ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch foreman-installer-1.6.0-0.2.RC1.el7ost.noarch openstack-puppet-modules-2014.2.8-1.el7ost.noarch puppet-3.6.2-2.el7.noarch puppet-server-3.6.2-2.el7.noarch --- Additional comment from Omri Hochman on 2015-01-14 15:01:39 EST --- dynflow view : ---------------- Note: - Task 54 is 'suspended' - Task 62 is 'pending' 49: Actions::Staypuft::Host::ReportWait (success) [ 4706.58s / 10.03s ] 52: Actions::Staypuft::Host::PuppetRun (success) [ 0.02s / 0.02s ] 54: Actions::Staypuft::Host::ReportWait (suspended) [ 12850.11s / 24.37s ] 57: Actions::Staypuft::Host::PuppetRun (success) [ 0.32s / 0.32s ] 59: Actions::Staypuft::Host::ReportWait (success) [ 4761.79s / 9.13s ] 62: Actions::Staypuft::Host::PuppetRun (pending) 64: Actions::Staypuft::Host::ReportWait (pending) Task 54 : --------- 54: Actions::Staypuft::Host::ReportWait (suspended) [ 12850.11s / 24.37s ] Started at: 2015-01-14 15:49:49 UTC Ended at: 2015-01-14 19:23:59 UTC Real time: 12850.11s Execution time (excluding suspended state): 24.37s Input: --- host_id: 3 after: '2015-01-14T10:49:49-05:00' current_user_id: 3 Output: --- status: false poll_attempts: total: 2563 failed: 0 Task 62: --------- Started at: Ended at: Real time: 0.00 Execution time (excluding suspended state): 0.00s Input: --- host_id: 4 name: maca25400702877.example.com current_user_id: 3 Output: --- {} --- Additional comment from Omri Hochman on 2015-01-14 15:13:10 EST --- Adding production.log : hoping it will help us understand why puppet wasn't triggered even once on that controller - after the provisioned was over. BTW puppet is installed on that machine. rpm -qa | grep puppet puppet-3.6.2-2.el7.noarch --- Additional comment from Jiri Stransky on 2015-01-15 05:28:11 EST --- Investigated one of the environments. It seems the cause might be yet another race condition. We check if a host is ready after provisioning and reboot by looking if the ssh port is open, but that doesn't seem to be enough. Sometimes even if the ssh port is open, the ssh connection for puppetssh apparently won't get established. Foreman Proxy only allows us to trigger puppetssh runs on the host and not run arbitrary commands and see their result, so we'll have to fix this by giving the hosts some time to finish booting up after we detect that ssh port is open. Even in the case where the puppet run *does* work, the first puppetssh connection is established ("Started Session 1 of user root." in syslog) 30 seconds before systemd prints "Reached target Multi-User System.". So giving the hosts some additional time would be desirable it seems. --- Additional comment from Jiri Stransky on 2015-01-15 05:37:29 EST --- Attached some log files from the deployment in question. (The deployer triggered puppet run on the problematic host manually.) --- Additional comment from RHEL Product and Program Management on 2015-01-15 09:12:19 EST --- Since this issue was entered in bugzilla, the release flag has been set to ? to ensure that it is properly evaluated for this release. --- Additional comment from Jiri Stransky on 2015-01-15 10:24:02 EST --- Relevant output from /var/log/foreman/production.log: Started GET "/reports?search=eventful+%3D+true" for 10.10.50.242 at 2015-01-14 15:52:52 -0500 Processing by ReportsController#index as HTML Parameters: {"search"=>"eventful = true"} Rendered reports/_list.html.erb (15.9ms) Rendered reports/index.html.erb within layouts/application (17.0ms) Rendered common/_searchbar.html.erb (2.8ms) Rendered home/_user_dropdown.html.erb (1.2ms) Read fragment views/tabs_and_title_records-3 (0.1ms) Rendered home/_topbar.html.erb (1.8ms) Rendered layouts/base.html.erb (3.0ms) Completed 200 OK in 32ms (Views: 24.3ms | ActiveRecord: 1.9ms) stream closed (IOError) /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:149:in `async_exec' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:149:in `block in execute_query' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/logging.rb:33:in `log_yield' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:149:in `execute_query' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:136:in `block in execute' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:115:in `check_disconnect_errors' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:136:in `execute' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:418:in `_execute' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:246:in `block (2 levels) in execute' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:430:in `check_database_errors' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:246:in `block in execute' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/connecting.rb:236:in `block in synchronize' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/connection_pool/threaded.rb:104:in `hold' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/connecting.rb:236:in `synchronize' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:246:in `execute' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/query.rb:79:in `execute_dui' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/dataset/actions.rb:861:in `execute_dui' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/dataset/actions.rb:774:in `update' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/persistence_adapters/sequel.rb:108:in `save' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/persistence_adapters/sequel.rb:56:in `save_step' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/persistence.rb:46:in `save_step' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract.rb:62:in `save' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/action.rb:266:in `save_state' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/action.rb:435:in `execute_run' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/action.rb:230:in `execute' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:9:in `block (2 levels) in execute' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract.rb:152:in `call' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract.rb:152:in `with_meta_calculation' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:8:in `block in execute' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:22:in `open_action' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:7:in `execute' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/executors/parallel/worker.rb:20:in `block in on_message' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:859:in `block in assigns' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:858:in `tap' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:858:in `assigns' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:138:in `match_value' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:116:in `block in match' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:115:in `each' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:115:in `match' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/executors/parallel/worker.rb:17:in `on_message' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:82:in `on_envelope' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:72:in `receive' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `block (2 levels) in run' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `loop' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `block in run' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `catch' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `run' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:13:in `block in initialize' --- Additional comment from Alexander Chuzhoy on 2015-01-16 14:29:00 EST --- Reproduced on BM environment: The puppet didn't run after the OS was provisioned. A host was "waituntilready" in dynamic flow. restarting the foreman-tasks service on that system manually caused the puppet to run on the controllers. --- Additional comment from Mike Burns on 2015-01-16 14:42:03 EST --- The latest occurrence was doing a "WaitUntilReady" step. This step polls the host for the ssh port to be open. The host was completely up, we could ssh in, etc. Workaround used: restart foreman-tasks service. Resume the deployment (resume dynflow). Host was immediately recognized as up and the deployment succeeded. --- Additional comment from Jiri Stransky on 2015-01-19 12:34:46 EST --- Reproduced again on a QE lab machine, this time a different action was running at the time of the exception - the PuppetRun action.
Timeouts pull request upstream: https://github.com/theforeman/staypuft/pull/410 As Mike mentioned in the whiteboard, timing out cannot mitigate the IOError issue but can mitigate other issues.
Timeouts won't mitigate the issue in bug 1182581, but they're still generically useful catching similar issues, so lets include anyway.
Verified: Environment: ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch openstack-foreman-installer-3.0.13-1.el7ost.noarch ruby193-rubygem-staypuft-0.5.15-1.el7ost.noarch rhel-osp-installer-client-0.5.5-2.el7ost.noarch openstack-puppet-modules-2014.2.8-2.el7ost.noarch rhel-osp-installer-0.5.5-2.el7ost.noarch The deployment will fail after ~2.5 hours with the following message in the dynflow console: ERF42-7017 [Staypuft::Exception]: You've reached the timeout set for this action. If the action is still ongoing, you can click on the "Resume Deployment" button to continue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0156.html