Bug 1182581 - deployments run forever because foreman-tasks stops polling ( "IOError" in foreman's production.log )
Summary: deployments run forever because foreman-tasks stops polling ( "IOError" in fo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: ruby193-rubygem-dynflow
Version: unspecified
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ga
: Installer
Assignee: Ivan Necas
QA Contact: Omri Hochman
URL:
Whiteboard:
: 1182576 (view as bug list)
Depends On: 1173634
Blocks: 1177026 1184630
TreeView+ depends on / blocked
 
Reported: 2015-01-15 13:57 UTC by Mike Burns
Modified: 2023-02-22 23:02 UTC (History)
15 users (show)

Fixed In Version: ruby193-rubygem-dynflow-0.7.3-3.el7ost
Doc Type: Bug Fix
Doc Text:
When the foreman-tasks process disconnected with the database, polling would stop and deployments would run forever. This fix corrects the code to reconnect in the event of a disconnect. If the reconnect fails, the processes restart. Deployments now continue despite momentary issues in database connectivity.
Clone Of: 1173634
: 1184630 (view as bug list)
Environment:
Last Closed: 2015-02-09 15:19:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
foreman production.log with IOError (3.93 MB, text/plain)
2015-01-19 17:35 UTC, Jiri Stransky
no flags Details
sosreport (5.63 MB, application/x-xz)
2015-01-19 17:36 UTC, Jiri Stransky
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0156 0 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform Installer Bug Fix Advisory 2015-02-09 20:13:39 UTC

Description Mike Burns 2015-01-15 13:57:51 UTC
This is the split of the second issue mentioned in but 1173634.  It has to do with a failure in postgres causing dynflow to hang forever.

Workaround is to manually restart foreman-tasks and resume the deployment in the UI.

+++ This bug was initially created as a clone of Bug #1173634 +++

rubygem-staypuft: Deployment runs without completing/failing for more than 17 hours.

Environment:
openstack-puppet-modules-2014.2.7-1.el7ost.noarch
rhel-osp-installer-0.5.2-2.el7ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch
rhel-osp-installer-client-0.5.2-2.el7ost.noarch
openstack-foreman-installer-3.0.5-1.el7ost.noarch
ruby193-rubygem-staypuft-0.5.4-1.el7ost.noarch


Steps to reproduce:
1. Install rhel-osp-installer
2. Create/start HAneutron deployment with 3 controllers and 1 compute host.


Result:

The deployment shows as running and doesn't complete nor does it fail.

Expected result:
The deployment should complete successfully (or fail due to some particular error).

--- Additional comment from Omri Hochman on 2015-01-14 14:54:48 EST ---

Failed-qa  :  We got this bug reproduced on two different environment (neutron-gre-ha) . 

In both cases we saw that: on 1 out of 3 controllers - the puppet agent wasn't triggered even once after the OS was provisioned --> that caused 'pcs' not to be installed on this controller - which reports the following error (reported in comment #21) : 

change from notrun to 0 failed: /usr/sbin/pcs cluster auth pcmk-maca25400702875 pcmk-maca25400702876 pcmk-maca25400702877 -u hacluster -p CHANGEME --force returned 1 instead of one of [0]

Despite this error, since puppet was not triggered once (or rerun 3 times) the deployment status in staypuft GUI remains 'Deploying..' for 17 hours.  



Environment:
-------------
rhel-osp-installer-0.5.5-1.el7ost.noarch
ruby193-rubygem-staypuft-0.5.11-1.el7ost.noarch
foreman-1.6.0.49-4.el7ost.noarch
rubygem-foreman_api-0.1.11-6.el7sat.noarch
openstack-foreman-installer-3.0.9-1.el7ost.noarch
ruby193-rubygem-foreman_openstack_simplify-0.0.6-8.el7ost.noarch
foreman-installer-1.6.0-0.2.RC1.el7ost.noarch
openstack-puppet-modules-2014.2.8-1.el7ost.noarch
puppet-3.6.2-2.el7.noarch
puppet-server-3.6.2-2.el7.noarch

--- Additional comment from Omri Hochman on 2015-01-14 15:01:39 EST ---

dynflow view :
----------------
Note:
- Task 54 is 'suspended'
- Task 62 is 'pending' 

49: Actions::Staypuft::Host::ReportWait (success) [ 4706.58s / 10.03s ]
52: Actions::Staypuft::Host::PuppetRun (success) [ 0.02s / 0.02s ]
54: Actions::Staypuft::Host::ReportWait (suspended) [ 12850.11s / 24.37s ]
57: Actions::Staypuft::Host::PuppetRun (success) [ 0.32s / 0.32s ]
59: Actions::Staypuft::Host::ReportWait (success) [ 4761.79s / 9.13s ]
62: Actions::Staypuft::Host::PuppetRun (pending)
64: Actions::Staypuft::Host::ReportWait (pending)


Task 54 :
---------
54: Actions::Staypuft::Host::ReportWait (suspended) [ 12850.11s / 24.37s ]
Started at: 2015-01-14 15:49:49 UTC
Ended at: 2015-01-14 19:23:59 UTC
Real time: 12850.11s
Execution time (excluding suspended state): 24.37s

Input:
---
host_id: 3
after: '2015-01-14T10:49:49-05:00'
current_user_id: 3

Output:
---
status: false
poll_attempts:
  total: 2563
  failed: 0

Task 62:
---------
Started at:
Ended at:
Real time: 0.00
Execution time (excluding suspended state): 0.00s

Input:
---
host_id: 4
name: maca25400702877.example.com
current_user_id: 3

Output:
---
 {}

--- Additional comment from Omri Hochman on 2015-01-14 15:13:10 EST ---

Adding production.log :
hoping it will help us understand why puppet wasn't triggered even once on that controller - after the provisioned was over.

BTW puppet is installed on that machine.
rpm -qa | grep puppet
puppet-3.6.2-2.el7.noarch

--- Additional comment from Jiri Stransky on 2015-01-15 05:28:11 EST ---

Investigated one of the environments. It seems the cause might be yet another race condition. We check if a host is ready after provisioning and reboot by looking if the ssh port is open, but that doesn't seem to be enough. Sometimes even if the ssh port is open, the ssh connection for puppetssh apparently won't get established.

Foreman Proxy only allows us to trigger puppetssh runs on the host and not run arbitrary commands and see their result, so we'll have to fix this by giving the hosts some time to finish booting up after we detect that ssh port is open.

Even in the case where the puppet run *does* work, the first puppetssh connection is established ("Started Session 1 of user root." in syslog) 30 seconds before systemd prints "Reached target Multi-User System.". So giving the hosts some additional time would be desirable it seems.


--- Additional comment from Jiri Stransky on 2015-01-15 05:37:29 EST ---

Attached some log files from the deployment in question. (The deployer triggered puppet run on the problematic host manually.)

Comment 2 Mike Burns 2015-01-15 14:23:01 UTC
*** Bug 1182576 has been marked as a duplicate of this bug. ***

Comment 4 Jiri Stransky 2015-01-15 15:24:02 UTC
Relevant output from /var/log/foreman/production.log:

Started GET "/reports?search=eventful+%3D+true" for 10.10.50.242 at 2015-01-14 15:52:52 -0500
Processing by ReportsController#index as HTML
  Parameters: {"search"=>"eventful = true"}
  Rendered reports/_list.html.erb (15.9ms)
  Rendered reports/index.html.erb within layouts/application (17.0ms)
  Rendered common/_searchbar.html.erb (2.8ms)
  Rendered home/_user_dropdown.html.erb (1.2ms)
Read fragment views/tabs_and_title_records-3 (0.1ms)
  Rendered home/_topbar.html.erb (1.8ms)
  Rendered layouts/base.html.erb (3.0ms)
Completed 200 OK in 32ms (Views: 24.3ms | ActiveRecord: 1.9ms)
stream closed (IOError)
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:149:in `async_exec'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:149:in `block in execute_query'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/logging.rb:33:in `log_yield'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:149:in `execute_query'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:136:in `block in execute'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:115:in `check_disconnect_errors'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:136:in `execute'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:418:in `_execute'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:246:in `block (2 levels) in execute'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:430:in `check_database_errors'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:246:in `block in execute'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/connecting.rb:236:in `block in synchronize'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/connection_pool/threaded.rb:104:in `hold'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/connecting.rb:236:in `synchronize'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/adapters/postgres.rb:246:in `execute'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/database/query.rb:79:in `execute_dui'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/dataset/actions.rb:861:in `execute_dui'
/opt/rh/ruby193/root/usr/share/gems/gems/sequel-3.45.0/lib/sequel/dataset/actions.rb:774:in `update'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/persistence_adapters/sequel.rb:108:in `save'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/persistence_adapters/sequel.rb:56:in `save_step'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/persistence.rb:46:in `save_step'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract.rb:62:in `save'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/action.rb:266:in `save_state'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/action.rb:435:in `execute_run'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/action.rb:230:in `execute'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:9:in `block (2 levels) in execute'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract.rb:152:in `call'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract.rb:152:in `with_meta_calculation'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:8:in `block in execute'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:22:in `open_action'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/execution_plan/steps/abstract_flow_step.rb:7:in `execute'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/executors/parallel/worker.rb:20:in `block in on_message'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:859:in `block in assigns'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:858:in `tap'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:858:in `assigns'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:138:in `match_value'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:116:in `block in match'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:115:in `each'
/opt/rh/ruby193/root/usr/share/gems/gems/algebrick-0.4.0/lib/algebrick.rb:115:in `match'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/executors/parallel/worker.rb:17:in `on_message'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:82:in `on_envelope'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:72:in `receive'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `block (2 levels) in run'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `loop'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `block in run'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `catch'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:99:in `run'
/opt/rh/ruby193/root/usr/share/gems/gems/dynflow-0.7.3/lib/dynflow/micro_actor.rb:13:in `block in initialize'

Comment 5 Mike Burns 2015-01-16 19:25:56 UTC
@dcleal -- can you get this the attention needed on the satellite/foreman side?

Comment 6 Alexander Chuzhoy 2015-01-16 19:29:00 UTC
Reproduced on BM environment:
The puppet didn't run after the OS was provisioned.
A host was "waituntilready" in dynamic flow.
restarting the foreman-tasks service on that system manually caused the puppet to run on the controllers.

Comment 7 Mike Burns 2015-01-16 19:42:03 UTC
The latest occurrence was doing a "WaitUntilReady" step.  This step polls the host for the ssh port to be open.  The host was completely up, we could ssh in, etc.  

Workaround used:  restart foreman-tasks service.  Resume the deployment (resume dynflow).  Host was immediately recognized as up and the deployment succeeded.

Comment 10 Jiri Stransky 2015-01-19 17:34:46 UTC
Reproduced again on a QE lab machine, this time a different action was running at the time of the exception - the PuppetRun action.

Comment 11 Jiri Stransky 2015-01-19 17:35:36 UTC
Created attachment 981563 [details]
foreman production.log with IOError

Comment 12 Jiri Stransky 2015-01-19 17:36:47 UTC
Created attachment 981564 [details]
sosreport

Comment 14 Ivan Necas 2015-01-23 11:53:08 UTC
A set of patches to help with the issue: https://github.com/Dynflow/dynflow/pull/135

With this changes, it should;

    1. Retry several times when db error occurs
    2. Terminate when the retries don't help: the foreman-tasks service will start the executor again automatically, switching the status of hang tasks to paused
    3. Let the executor to finish just after the workers are done: don't wait for the whole task to finish.

Comment 17 Omri Hochman 2015-01-29 16:15:20 UTC
Unable to reproduce using: 
ruby193-rubygem-dynflow-0.7.3-3.el7ost.noarch
rhel-osp-installer-0.5.5-2.el7ost.noarch

Comment 19 errata-xmlrpc 2015-02-09 15:19:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0156.html


Note You need to log in before you can comment on or make changes to this bug.