Bug 1469307
Summary: | Ansible workers not starting | |||
---|---|---|---|---|
Product: | Red Hat CloudForms Management Engine | Reporter: | Ryan Spagnola <rspagnol> | |
Component: | Appliance | Assignee: | Joe Rafaniello <jrafanie> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | luke couzens <lcouzens> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | 5.8.0 | CC: | abellott, cpelland, gtanzill, hkataria, jhardy, jrafanie, mpovolny, myoder, obarenbo, rspagnol, simaishi | |
Target Milestone: | GA | Keywords: | TestOnly, ZStream | |
Target Release: | 5.9.0 | |||
Hardware: | All | |||
OS: | All | |||
Whiteboard: | ansible_embed:black:migration | |||
Fixed In Version: | 5.9.0.1 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1473787 (view as bug list) | Environment: | ||
Last Closed: | 2018-03-06 15:56:39 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | Bug | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | CFME Core | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1473787 |
Comment 5
CFME Bot
2017-07-18 21:43:22 UTC
*** Bug 1468898 has been marked as a duplicate of this bug. *** QE: For testing purposes, you can recreate this issue by lowering the starting_timeout of the embedded_ansible_worker from 20.minutes (in advanced settings) to 2 minutes (or any value less than the normal installation time). :embedded_ansible_worker: :starting_timeout: 2.minutes :poll: 10.seconds :memory_threshold: 0.megabytes After the starting_timeout value is reached, the worker will be marked as not responding and a new one will be started. After this happens a 3-4 times, you should get the "undefined method" errors. With this fix in place, the worker should be marked as not responding, we should attempt to kill the monitor thread if it didn't exit on it's own... so you should be able to run many workers in a row, all failing to install within the starting_timeout, with no other errors reported. Note, to fix the original problem, that the 20 minute starting timeout is not enough, the customer/user should increase this timeout to 30 or 40 minutes, depending on how long it takes to run the setup. We should probably get logs after this fix is applied so we can figure out why it is taking so long to run the setup script. New commit detected on ManageIQ/manageiq-gems-pending/master: https://github.com/ManageIQ/manageiq-gems-pending/commit/59aab09a52b63113698ee042c91719e0518901a3 commit 59aab09a52b63113698ee042c91719e0518901a3 Author: Joe Rafaniello <jrafanie> AuthorDate: Tue Jul 18 17:29:33 2017 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Wed Jul 19 13:46:48 2017 -0400 Only return the cached value if it exists. If the cache was cleared and an exception was raised when trying to set the new `cache[:value]`, we'd end up setting the `cache[:timeout]` but not the new value. Any subsequent call would return a nil from `cache[:value]` because `cache[:timeout]` exists, making `cache` non-empty. The new code only returns from the cache if the :value key exists in the `cache`. Additionally, for consistency, we only set the `cache[:timeout]` and `cache[:value]` after we've calculated these values, therefore the `cache` doesn't get partially set. This is not required but makes me feel better. https://bugzilla.redhat.com/show_bug.cgi?id=1469307 https://bugzilla.redhat.com/show_bug.cgi?id=1468898 This replaces https://github.com/ManageIQ/manageiq-gems-pending/pull/244 lib/gems/pending/util/extensions/miq-module.rb | 5 +++-- spec/util/extensions/miq-module_spec.rb | 27 ++++++++++++++++++++++++++ 2 files changed, 30 insertions(+), 2 deletions(-) New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/63816f42aa17f7a6ce4d0ed324448da913f78831 commit 63816f42aa17f7a6ce4d0ed324448da913f78831 Author: Joe Rafaniello <jrafanie> AuthorDate: Wed Jul 19 16:36:56 2017 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Jul 21 11:00:44 2017 -0400 Delegate the pid knowledge to the worker row. Each worker can implement their own way to invoke processes so let them choose and persist their pid value, don't assume we can use Process.pid. https://bugzilla.redhat.com/show_bug.cgi?id=1469307 https://bugzilla.redhat.com/show_bug.cgi?id=1468898 app/models/miq_worker/runner.rb | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/d49338133fe451db9d46df0252ed5c4f3b9825b8 commit d49338133fe451db9d46df0252ed5c4f3b9825b8 Author: Joe Rafaniello <jrafanie> AuthorDate: Wed Jul 19 16:38:48 2017 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Jul 21 11:00:44 2017 -0400 Tag the monitor thread with the worker info so we can kill it later When the server starts the monitor thread, store the worker class and id in the thread object so the server can then kill that thread if required later. Implement stop/kill/terminate in the same way: look for the thread containing the worker's class and id in Thread.list, exit it, and destroy the worker row. https://bugzilla.redhat.com/show_bug.cgi?id=1469307 https://bugzilla.redhat.com/show_bug.cgi?id=1468898 app/models/embedded_ansible_worker.rb | 33 +++++++++++++++++-- spec/models/embedded_ansible_worker_spec.rb | 50 +++++++++++++++++++++++++++++ 2 files changed, 80 insertions(+), 3 deletions(-) New commit detected on ManageIQ/manageiq/fine: https://github.com/ManageIQ/manageiq/commit/48faba0f59be5dcbb35becb102670d31d6ad00f2 commit 48faba0f59be5dcbb35becb102670d31d6ad00f2 Author: Joe Rafaniello <jrafanie> AuthorDate: Wed Jul 19 16:36:56 2017 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Jul 21 12:30:17 2017 -0400 Delegate the pid knowledge to the worker row. Each worker can implement their own way to invoke processes so let them choose and persist their pid value, don't assume we can use Process.pid. https://bugzilla.redhat.com/show_bug.cgi?id=1469307 https://bugzilla.redhat.com/show_bug.cgi?id=1468898 app/models/miq_worker/runner.rb | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) New commit detected on ManageIQ/manageiq/fine: https://github.com/ManageIQ/manageiq/commit/86ec71fa64022c8aa19046b4fa27542cfc601817 commit 86ec71fa64022c8aa19046b4fa27542cfc601817 Author: Joe Rafaniello <jrafanie> AuthorDate: Wed Jul 19 16:38:48 2017 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Fri Jul 21 12:31:20 2017 -0400 Tag the monitor thread with the worker info so we can kill it later When the server starts the monitor thread, store the worker class and id in the thread object so the server can then kill that thread if required later. Implement stop/kill/terminate in the same way: look for the thread containing the worker's class and id in Thread.list, exit it, and destroy the worker row. https://bugzilla.redhat.com/show_bug.cgi?id=1469307 https://bugzilla.redhat.com/show_bug.cgi?id=1468898 app/models/embedded_ansible_worker.rb | 32 ++++++++++++++++-- spec/models/embedded_ansible_worker_spec.rb | 50 +++++++++++++++++++++++++++++ 2 files changed, 80 insertions(+), 2 deletions(-) Verified in 5.9.0.2 |