Bug 1469307

Summary: Ansible workers not starting
Product: Red Hat CloudForms Management Engine Reporter: Ryan Spagnola <rspagnol>
Component: ApplianceAssignee: Joe Rafaniello <jrafanie>
Status: CLOSED CURRENTRELEASE QA Contact: luke couzens <lcouzens>
Severity: urgent Docs Contact:
Priority: high    
Version: 5.8.0CC: abellott, cpelland, gtanzill, hkataria, jhardy, jrafanie, mpovolny, myoder, obarenbo, rspagnol, simaishi
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.9.0   
Hardware: All   
OS: All   
Whiteboard: ansible_embed:black:migration
Fixed In Version: 5.9.0.1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1473787 (view as bug list) Environment:
Last Closed: 2018-03-06 15:56:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: Bug
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1473787    

Comment 8 Joe Rafaniello 2017-07-21 15:56:03 UTC
*** Bug 1468898 has been marked as a duplicate of this bug. ***

Comment 9 Joe Rafaniello 2017-07-21 16:07:58 UTC
QE: For testing purposes, you can recreate this issue by lowering the starting_timeout of the embedded_ansible_worker from 20.minutes (in advanced settings) to 2 minutes (or any value less than the normal installation time).

    :embedded_ansible_worker:
      :starting_timeout: 2.minutes
      :poll: 10.seconds
      :memory_threshold: 0.megabytes

After the starting_timeout value is reached, the worker will be marked as not responding and a new one will be started.  After this happens a 3-4 times, you should get the "undefined method" errors.  With this fix in place, the worker should be marked as not responding, we should attempt to kill the monitor thread if it didn't exit on it's own... so you should be able to run many workers in a row, all failing to install within the starting_timeout, with no other errors reported.


Note, to fix the original problem, that the 20 minute starting timeout is not enough, the customer/user should increase this timeout to 30 or 40 minutes, depending on how long it takes to run the setup.  We should probably get logs after this fix is applied so we can figure out why it is taking so long to run the setup script.

Comment 10 CFME Bot 2017-07-21 16:08:15 UTC
New commit detected on ManageIQ/manageiq-gems-pending/master:
https://github.com/ManageIQ/manageiq-gems-pending/commit/59aab09a52b63113698ee042c91719e0518901a3

commit 59aab09a52b63113698ee042c91719e0518901a3
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Tue Jul 18 17:29:33 2017 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Wed Jul 19 13:46:48 2017 -0400

    Only return the cached value if it exists.
    
    If the cache was cleared and an exception was raised when trying to set
    the new `cache[:value]`, we'd end up setting the `cache[:timeout]` but
    not the new value.  Any subsequent call would return a nil from
    `cache[:value]` because `cache[:timeout]` exists, making `cache` non-empty.
    
    The new code only returns from the cache if the :value key exists in the
    `cache`.
    
    Additionally, for consistency, we only set the `cache[:timeout]`
    and `cache[:value]` after we've calculated these values, therefore the
    `cache` doesn't get partially set.  This is not required but makes me
    feel better.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1469307
    https://bugzilla.redhat.com/show_bug.cgi?id=1468898
    
    This replaces https://github.com/ManageIQ/manageiq-gems-pending/pull/244

 lib/gems/pending/util/extensions/miq-module.rb |  5 +++--
 spec/util/extensions/miq-module_spec.rb        | 27 ++++++++++++++++++++++++++
 2 files changed, 30 insertions(+), 2 deletions(-)

Comment 11 CFME Bot 2017-07-21 16:11:37 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/63816f42aa17f7a6ce4d0ed324448da913f78831

commit 63816f42aa17f7a6ce4d0ed324448da913f78831
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Wed Jul 19 16:36:56 2017 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Jul 21 11:00:44 2017 -0400

    Delegate the pid knowledge to the worker row.
    
    Each worker can implement their own way to invoke processes so let them
    choose and persist their pid value, don't assume we can use Process.pid.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1469307
    https://bugzilla.redhat.com/show_bug.cgi?id=1468898

 app/models/miq_worker/runner.rb | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comment 12 CFME Bot 2017-07-21 16:11:43 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/d49338133fe451db9d46df0252ed5c4f3b9825b8

commit d49338133fe451db9d46df0252ed5c4f3b9825b8
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Wed Jul 19 16:38:48 2017 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Jul 21 11:00:44 2017 -0400

    Tag the monitor thread with the worker info so we can kill it later
    
    When the server starts the monitor thread, store the worker class and id
    in the thread object so the server can then kill that thread if required
    later.
    
    Implement stop/kill/terminate in the same way:  look for the thread
    containing the worker's class and id in Thread.list, exit it, and
    destroy the worker row.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1469307
    https://bugzilla.redhat.com/show_bug.cgi?id=1468898

 app/models/embedded_ansible_worker.rb       | 33 +++++++++++++++++--
 spec/models/embedded_ansible_worker_spec.rb | 50 +++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+), 3 deletions(-)

Comment 14 CFME Bot 2017-07-21 17:51:16 UTC
New commit detected on ManageIQ/manageiq/fine:
https://github.com/ManageIQ/manageiq/commit/48faba0f59be5dcbb35becb102670d31d6ad00f2

commit 48faba0f59be5dcbb35becb102670d31d6ad00f2
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Wed Jul 19 16:36:56 2017 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Jul 21 12:30:17 2017 -0400

    Delegate the pid knowledge to the worker row.
    
    Each worker can implement their own way to invoke processes so let them
    choose and persist their pid value, don't assume we can use Process.pid.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1469307
    https://bugzilla.redhat.com/show_bug.cgi?id=1468898

 app/models/miq_worker/runner.rb | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comment 15 CFME Bot 2017-07-21 17:51:28 UTC
New commit detected on ManageIQ/manageiq/fine:
https://github.com/ManageIQ/manageiq/commit/86ec71fa64022c8aa19046b4fa27542cfc601817

commit 86ec71fa64022c8aa19046b4fa27542cfc601817
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Wed Jul 19 16:38:48 2017 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Fri Jul 21 12:31:20 2017 -0400

    Tag the monitor thread with the worker info so we can kill it later
    
    When the server starts the monitor thread, store the worker class and id
    in the thread object so the server can then kill that thread if required
    later.
    
    Implement stop/kill/terminate in the same way:  look for the thread
    containing the worker's class and id in Thread.list, exit it, and
    destroy the worker row.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1469307
    https://bugzilla.redhat.com/show_bug.cgi?id=1468898

 app/models/embedded_ansible_worker.rb       | 32 ++++++++++++++++--
 spec/models/embedded_ansible_worker_spec.rb | 50 +++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+), 2 deletions(-)

Comment 16 luke couzens 2017-10-12 18:37:48 UTC
Verified in 5.9.0.2