This is the split of the second issue mentioned in but 1173634. It has to do with a failure in postgres causing dynflow to hang forever. Workaround is to manually restart foreman-tasks and resume the deployment in the UI. +++ This bug was initially created as a clone of Bug #1173634 +++ rubygem-staypuft: Deployment runs without completing/failing for more than 17 hours. Environment: openstack-puppet-modules-2014.2.7-1.el7ost.noarch rhel-osp-installer-0.5.2-2.el7ost.noarch ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch rhel-osp-installer-client-0.5.2-2.el7ost.noarch openstack-foreman-installer-3.0.5-1.el7ost.noarch ruby193-rubygem-staypuft-0.5.4-1.el7ost.noarch Steps to reproduce: 1. Install rhel-osp-installer 2. Create/start HAneutron deployment with 3 controllers and 1 compute host. Result: The deployment shows as running and doesn't complete nor does it fail. Expected result: The deployment should complete successfully (or fail due to some particular error). --- Additional comment from Omri Hochman on 2015-01-14 14:54:48 EST --- Failed-qa : We got this bug reproduced on two different environment (neutron-gre-ha) . In both cases we saw that: on 1 out of 3 controllers - the puppet agent wasn't triggered even once after the OS was provisioned --> that caused 'pcs' not to be installed on this controller - which reports the following error (reported in comment #21) : change from notrun to 0 failed: /usr/sbin/pcs cluster auth pcmk-maca25400702875 pcmk-maca25400702876 pcmk-maca25400702877 -u hacluster -p CHANGEME --force returned 1 instead of one of [0] Despite this error, since puppet was not triggered once (or rerun 3 times) the deployment status in staypuft GUI remains 'Deploying..' for 17 hours. Environment: ------------- rhel-osp-installer-0.5.5-1.el7ost.noarch ruby193-rubygem-staypuft-0.5.11-1.el7ost.noarch foreman-1.6.0.49-4.el7ost.noarch rubygem-foreman_api-0.1.11-6.el7sat.noarch openstack-foreman-installer-3.0.9-1.el7ost.noarch ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch foreman-installer-1.6.0-0.2.RC1.el7ost.noarch openstack-puppet-modules-2014.2.8-1.el7ost.noarch puppet-3.6.2-2.el7.noarch puppet-server-3.6.2-2.el7.noarch --- Additional comment from Omri Hochman on 2015-01-14 15:01:39 EST --- dynflow view : ---------------- Note: - Task 54 is 'suspended' - Task 62 is 'pending' 49: Actions::Staypuft::Host::ReportWait (success) [ 4706.58s / 10.03s ] 52: Actions::Staypuft::Host::PuppetRun (success) [ 0.02s / 0.02s ] 54: Actions::Staypuft::Host::ReportWait (suspended) [ 12850.11s / 24.37s ] 57: Actions::Staypuft::Host::PuppetRun (success) [ 0.32s / 0.32s ] 59: Actions::Staypuft::Host::ReportWait (success) [ 4761.79s / 9.13s ] 62: Actions::Staypuft::Host::PuppetRun (pending) 64: Actions::Staypuft::Host::ReportWait (pending) Task 54 : --------- 54: Actions::Staypuft::Host::ReportWait (suspended) [ 12850.11s / 24.37s ] Started at: 2015-01-14 15:49:49 UTC Ended at: 2015-01-14 19:23:59 UTC Real time: 12850.11s Execution time (excluding suspended state): 24.37s Input: --- host_id: 3 after: '2015-01-14T10:49:49-05:00' current_user_id: 3 Output: --- status: false poll_attempts: total: 2563 failed: 0 Task 62: --------- Started at: Ended at: Real time: 0.00 Execution time (excluding suspended state): 0.00s Input: --- host_id: 4 name: maca25400702877.example.com current_user_id: 3 Output: --- {} --- Additional comment from Omri Hochman on 2015-01-14 15:13:10 EST --- Adding production.log : hoping it will help us understand why puppet wasn't triggered even once on that controller - after the provisioned was over. BTW puppet is installed on that machine. rpm -qa | grep puppet puppet-3.6.2-2.el7.noarch --- Additional comment from Jiri Stransky on 2015-01-15 05:28:11 EST --- Investigated one of the environments. It seems the cause might be yet another race condition. We check if a host is ready after provisioning and reboot by looking if the ssh port is open, but that doesn't seem to be enough. Sometimes even if the ssh port is open, the ssh connection for puppetssh apparently won't get established. Foreman Proxy only allows us to trigger puppetssh runs on the host and not run arbitrary commands and see their result, so we'll have to fix this by giving the hosts some time to finish booting up after we detect that ssh port is open. Even in the case where the puppet run *does* work, the first puppetssh connection is established ("Started Session 1 of user root." in syslog) 30 seconds before systemd prints "Reached target Multi-User System.". So giving the hosts some additional time would be desirable it seems. --- Additional comment from Jiri Stransky on 2015-01-15 05:37:29 EST --- Attached some log files from the deployment in question. (The deployer triggered puppet run on the problematic host manually.)
*** Bug 1182576 has been marked as a duplicate of this bug. ***
Relevant output from /var/log/foreman/production.log: Started GET "/reports?search=eventful+%3D+true" for 10.10.50.242 at 2015-01-14 15:52:52 -0500 Processing by ReportsController#index as HTML Parameters: {"search"=>"eventful = true"} Rendered reports/_list.html.erb (15.9ms) Rendered reports/index.html.erb within layouts/application (17.0ms) Rendered common/_searchbar.html.erb (2.8ms) Rendered home/_user_dropdown.html.erb (1.2ms) Read fragment views/tabs_and_title_records-3 (0.1ms) Rendered home/_topbar.html.erb (1.8ms) Rendered layouts/base.html.erb (3.0ms) Completed 200 OK in 32ms (Views: 24.3ms | ActiveRecord: 1.9ms) stream closed (IOError) /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:149:in `async_exec' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:149:in `block in execute_query' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/logging.rb:33:in `log_yield' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:149:in `execute_query' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:136:in `block in execute' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:115:in `check_disconnect_errors' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:136:in `execute' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:418:in `_execute' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:246:in `block (2 levels) in execute' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:430:in `check_database_errors' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:246:in `block in execute' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/connecting.rb:236:in `block in synchronize' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/connection_pool/threaded.rb:104:in `hold' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/connecting.rb:236:in `synchronize' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:246:in `execute' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/query.rb:79:in `execute_dui' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/dataset/actions.rb:861:in `execute_dui' /opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/dataset/actions.rb:774:in `update' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/persistence_adapters/sequel.rb:108:in `save' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/persistence_adapters/sequel.rb:56:in `save_step' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/persistence.rb:46:in `save_step' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract.rb:62:in `save' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/action.rb:266:in `save_state' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/action.rb:435:in `execute_run' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/action.rb:230:in `execute' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:9:in `block (2 levels) in execute' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract.rb:152:in `call' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract.rb:152:in `with_meta_calculation' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:8:in `block in execute' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:22:in `open_action' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:7:in `execute' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/executors/parallel/worker.rb:20:in `block in on_message' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:859:in `block in assigns' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:858:in `tap' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:858:in `assigns' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:138:in `match_value' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:116:in `block in match' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:115:in `each' /opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:115:in `match' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/executors/parallel/worker.rb:17:in `on_message' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:82:in `on_envelope' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:72:in `receive' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `block (2 levels) in run' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `loop' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `block in run' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `catch' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `run' /opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:13:in `block in initialize'
@dcleal -- can you get this the attention needed on the satellite/foreman side?
Reproduced on BM environment: The puppet didn't run after the OS was provisioned. A host was "waituntilready" in dynamic flow. restarting the foreman-tasks service on that system manually caused the puppet to run on the controllers.
The latest occurrence was doing a "WaitUntilReady" step. This step polls the host for the ssh port to be open. The host was completely up, we could ssh in, etc. Workaround used: restart foreman-tasks service. Resume the deployment (resume dynflow). Host was immediately recognized as up and the deployment succeeded.
Reproduced again on a QE lab machine, this time a different action was running at the time of the exception - the PuppetRun action.
Created attachment 981563 [details] foreman production.log with IOError
Created attachment 981564 [details] sosreport
A set of patches to help with the issue: https://github.com/Dynflow/dynflow/pull/135 With this changes, it should; 1. Retry several times when db error occurs 2. Terminate when the retries don't help: the foreman-tasks service will start the executor again automatically, switching the status of hang tasks to paused 3. Let the executor to finish just after the workers are done: don't wait for the whole task to finish.
Unable to reproduce using: ruby193-rubygem-dynflow-0.7.3-3.el7ost.noarch rhel-osp-installer-0.5.5-2.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0156.html