Description of problem: Model method ends and does not start next method. Instead CloudForms will wait until the timeout and error. [----] I, [2015-08-31T10:51:06.578562 #25294:6466b34] INFO -- : Q-task_id([automation_task_11000000003179]) <AEMethod get_disk_data> get_disk_info - CloudForms Automate Method Ended [----] E, [2015-08-31T11:00:57.983707 #25294:531010] ERROR -- : Q-task_id([automation_task_11000000003179]) State=<GetDiskInfoWindows> running raised exception: <execution expired> [----] W, [2015-08-31T11:00:57.984068 #25294:531010] WARN -- : Q-task_id([automation_task_11000000003179]) Error in State=[GetDiskInfoWindows] Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The log snippets don't seem to match, from the error message in the log snippet raised exception: <execution expired> Indicates that the Automate method had errors when it was talking to external servers using the rest-client or the winrm. Suggested that they rescue Timeout::Error and either retry or fail appropriately based on the customer needs.
Josh, Can we please get the log with the backtrace included. Thanks
On further analysis we found that every automate request is constrained by the queue message timeout. The automate request has to end within the msg_timeout for the queue entry, which currently defaults to 600 seconds (10 minutes). If the request doesn't end in 10 minutes, we get a timeout error, which because of a bug in our code doesn't get propagated properly and the cleanup doesn't happen. If we have an automate method that gets dispatched as part of the request the method becomes a detached process (zombies) which will not get cleaned and will accumulate in the system till the next restart of the OS. We have to fix the propagation of the timeout error from from the request and delete any automate process that we might have launched. In the customers scenario the automate method that uses winrm is taking upwards of 10 minutes to execute when the queue terminates the automate request. Even after we fix the propogation of the error code we still would be hitting the timeout. It is recommended that the customer write the automate methods as separate steps that they can use from inside Automate state machine which has a built in retry logic.
Analysis so far. The crux of the issue "seems" to be a bug in the gssapi gem releases the same pointer twice and gets a segfault. This causes reams of data being written to the STDERR. This is a known bug listed here https://github.com/zenchild/gssapi/issues/12 The customer is using CFME 5.3 which uses an older version of gssapi (1.1.2) which does have the bug fixed. This bug manifests itself in CFME in the following ways (1) Since we were not reading from STDERR, the reams of data that got generated during the segfault gets deadlocked waiting for someone to empty the STDERR channel. The automate method waits. (2) After 10 minutes the Queue timeout kicks in and terminates the Parent thread that had started the Automate Request, because of bugs in the way that timeout was being handled we didn't cleanup the Automate method and left it hanging as a detached process. We are trying to see if we can apply the patch to the gssapi library to confirm that this is the source of the problem. The Automate method that they customer uses the winrm and gossip gems.
In the above comment I incorrectly state that it is fixed in gssapi 1.1.2. It has not been fixed in 1.1.2. It has only been fixed in 1.2
Created attachment 1071477 [details] Proposed Automate GSSAPI SEGFAULT Fault This code can be inserted on top of the Automate methods that use the GSSAPI version 1.1.2 gem. It monkey patches the GSSAPI context methods so that we check the context before creating new ones as explained in the Pull Request. This should only be used if the GSSAPI gem cannot be updated on all servers. Using the Automate method patch saves the trouble of updating all servers to have the latest gssapi. Version 1.2 of gssapi has this issue fixed.
https://github.com/ManageIQ/manageiq/pull/4302
https://github.com/ManageIQ/manageiq/pull/4170
https://github.com/ManageIQ/manageiq/pull/3823
https://github.com/ManageIQ/manageiq/pull/4437
https://github.com/ManageIQ/manageiq/pull/4436
In order to test this issue in 5.4 you can have a long running automate method that does 2 things (1) Writes lots of information to stderr (2) Sleeps over 10 minutes, which is the default queue timeout. At the end of 10 minutes the Queue worker will terminate the automate thread leaving behind orphaned methods. The fixes in here will make sure that stderr gets drained properly and long running methods don't leave orphaned processes around. The cleanup after 10 minutes should terminate the methods and there should be appropriate log messages.
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/a9470c323e3330bab9d519ff3e51d0e24b96e5fa commit a9470c323e3330bab9d519ff3e51d0e24b96e5fa Author: Madhu Kanoor <mkanoor> AuthorDate: Tue Sep 15 16:59:39 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Mon Sep 21 18:16:44 2015 -0400 Implemented PR feedback https://bugzilla.redhat.com/show_bug.cgi?id=1258648 Use chomp instead of strip Removed the extra require lib/miq_automation_engine/engine/miq_ae_method.rb | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/0fa5243e5bfacbbc57382b8f2f2bb191dd090ac4 commit 0fa5243e5bfacbbc57382b8f2f2bb191dd090ac4 Author: Madhu Kanoor <mkanoor> AuthorDate: Fri Sep 18 16:08:28 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Mon Sep 21 18:16:44 2015 -0400 PR Review changes https://bugzilla.redhat.com/show_bug.cgi?id=1258648 lib/miq_automation_engine/engine/miq_ae_method.rb | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/47266059a5ba66385916a53ba3d3ef6ae50a196d commit 47266059a5ba66385916a53ba3d3ef6ae50a196d Author: Madhu Kanoor <mkanoor> AuthorDate: Thu Sep 10 11:23:48 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Mon Sep 21 18:16:44 2015 -0400 Fixed Rubocop warnings https://bugzilla.redhat.com/show_bug.cgi?id=1258648 spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb | 10 +++++----- spec/support/automation_spec_helper.rb | 8 ++++---- 2 files changed, 9 insertions(+), 9 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/def6df099dc070d16d276171221f65948fcd9b61 commit def6df099dc070d16d276171221f65948fcd9b61 Author: Madhu Kanoor <mkanoor> AuthorDate: Thu Sep 10 10:07:12 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Mon Sep 21 18:16:44 2015 -0400 Fixed Automate Method handling https://bugzilla.redhat.com/show_bug.cgi?id=1258648 Addressed the following issues (1) Flush stderr and stdout in separate threads, prior to this it would cause deadlock if the stderr was not drained. (2) Ensure block makes sure that the Automate method is terminated if it doesn't respond by the :msg_timeout specified in the queue. This used to leave orphaned processes. (3) Terminate the stdout and stderr reading threads in the ensure block. Added a spec with automate method that writes to stdout/stderr and sleeps waiting to be terminated. lib/miq_automation_engine/engine/miq_ae_method.rb | 62 +++++++++--- .../miq_ae_method_dispatch_spec.rb | 105 +++++++++++++++++++++ spec/support/automation_spec_helper.rb | 8 +- 3 files changed, 160 insertions(+), 15 deletions(-) create mode 100644 spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/88232c7214a6070e88856a3c64b4c61ed3e67574 commit 88232c7214a6070e88856a3c64b4c61ed3e67574 Author: Madhu Kanoor <mkanoor> AuthorDate: Mon Sep 14 14:33:32 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Mon Sep 21 18:16:44 2015 -0400 Fixed based on PR comments https://bugzilla.redhat.com/show_bug.cgi?id=1258648 Combined the shared code in invoke_external and run_ruby_method into a new function called run_method. Added a new function that does the cleanup (terminate the process and exit the threads) lib/miq_automation_engine/engine/miq_ae_method.rb | 151 ++++++++++------------ 1 file changed, 65 insertions(+), 86 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/580cd638b0be9ee430209ebe6b93adc25c81adf1 commit 580cd638b0be9ee430209ebe6b93adc25c81adf1 Author: Madhu Kanoor <mkanoor> AuthorDate: Mon Sep 21 10:11:29 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Mon Sep 21 18:16:44 2015 -0400 Added comments around long running method https://bugzilla.redhat.com/show_bug.cgi?id=1258648 Reduce the sleep time for the unresponsive method spec to 1 minute. Added comments about the sleep in the Automate method spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
https://github.com/ManageIQ/manageiq/pull/2428
Detected commit referencing this ticket while ticket status is POST.
New commit detected on cfme/5.3.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=5159a2c3487601d8337219135972a9d4b9c83654 commit 5159a2c3487601d8337219135972a9d4b9c83654 Author: Madhu Kanoor <mkanoor> AuthorDate: Thu Sep 10 10:07:12 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Wed Sep 23 14:44:05 2015 -0400 Fixed Automate Method handling https://bugzilla.redhat.com/show_bug.cgi?id=1258648 Addressed the following issues (1) Flush stderr and stdout in separate threads, prior to this it would cause deadlock if the stderr was not drained. (2) Ensure block makes sure that the Automate method is terminated if it doesn't respond by the :msg_timeout specified in the queue. This used to leave orphaned processes. (3) Terminate the stdout and stderr reading threads in the ensure block. Added a spec with automate method that writes to stdout/stderr and sleeps waiting to be terminated. .../miq_ae_method_dispatch_spec.rb | 105 +++++++++++++++++++++ .../miq_automation_engine/engine/miq_ae_method.rb | 71 ++++++++++---- vmdb/spec/support/automation_spec_helper.rb | 46 +++++++++ 3 files changed, 202 insertions(+), 20 deletions(-) create mode 100644 spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb
New commit detected on cfme/5.3.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=4a50370eadcda11e59141028655111752a3b30c4 commit 4a50370eadcda11e59141028655111752a3b30c4 Author: Madhu Kanoor <mkanoor> AuthorDate: Thu Sep 10 11:23:48 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Wed Sep 23 14:44:05 2015 -0400 Fixed Rubocop warnings https://bugzilla.redhat.com/show_bug.cgi?id=1258648 spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb | 10 +++++----- vmdb/spec/support/automation_spec_helper.rb | 8 ++++---- 2 files changed, 9 insertions(+), 9 deletions(-)
New commit detected on cfme/5.3.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=b173e756c2a73251e40a01de16a1b97d2fcd8d91 commit b173e756c2a73251e40a01de16a1b97d2fcd8d91 Author: Madhu Kanoor <mkanoor> AuthorDate: Mon Sep 14 14:33:32 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Wed Sep 23 14:46:09 2015 -0400 Fixed based on PR comments https://bugzilla.redhat.com/show_bug.cgi?id=1258648 Combined the shared code in invoke_external and run_ruby_method into a new function called run_method. Added a new function that does the cleanup (terminate the process and exit the threads) .../miq_automation_engine/engine/miq_ae_method.rb | 151 +++++++++------------ 1 file changed, 65 insertions(+), 86 deletions(-)
New commit detected on cfme/5.3.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=110cabaa0be9a79706c4f2988673a61f7c148512 commit 110cabaa0be9a79706c4f2988673a61f7c148512 Author: Madhu Kanoor <mkanoor> AuthorDate: Tue Sep 15 16:59:39 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Wed Sep 23 14:46:09 2015 -0400 Implemented PR feedback https://bugzilla.redhat.com/show_bug.cgi?id=1258648 Use chomp instead of strip Removed the extra require vmdb/lib/miq_automation_engine/engine/miq_ae_method.rb | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)
New commit detected on cfme/5.3.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=e5cd029f919350e1ae294857611e66fd56cb39ff commit e5cd029f919350e1ae294857611e66fd56cb39ff Author: Madhu Kanoor <mkanoor> AuthorDate: Fri Sep 18 16:08:28 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Wed Sep 23 14:46:09 2015 -0400 PR Review changes https://bugzilla.redhat.com/show_bug.cgi?id=1258648 vmdb/lib/miq_automation_engine/engine/miq_ae_method.rb | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-)
New commit detected on cfme/5.3.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=c869617a827f827af086fe85e59473482257e1ad commit c869617a827f827af086fe85e59473482257e1ad Author: Madhu Kanoor <mkanoor> AuthorDate: Mon Sep 21 10:11:29 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Wed Sep 23 14:46:09 2015 -0400 Added comments around long running method https://bugzilla.redhat.com/show_bug.cgi?id=1258648 Reduce the sleep time for the unresponsive method spec to 1 minute. Added comments about the sleep in the Automate method spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
New commit detected on cfme/5.3.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=2c830c6f52f08d0d791577db257dcb05ad15fc0c commit 2c830c6f52f08d0d791577db257dcb05ad15fc0c Author: Madhu Kanoor <mkanoor> AuthorDate: Tue Sep 22 14:28:17 2015 -0400 Commit: Madhu Kanoor <mkanoor> CommitDate: Wed Sep 23 14:46:10 2015 -0400 Moved new spec file to correct directory https://bugzilla.redhat.com/show_bug.cgi?id=1258648 .../miq_ae_method_dispatch_spec.rb | 110 --------------------- .../miq_ae_method_dispatch_spec.rb | 110 +++++++++++++++++++++ 2 files changed, 110 insertions(+), 110 deletions(-) delete mode 100644 spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb create mode 100644 vmdb/spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb
I'm not going to be able to say definitely that this issue is resolved as I've been seen or been able to reproduce it. If it's still in there, it's hiding. I can say that none of the above errors appear on any of my appliances and I run and maintain a large collection of windows VMs. Additionally, I look at the code changes and it seems that if we are timing out stale issues after 60 seconds, we're probably not going to run into the 600 second timeout. So, I'm moving this to Verified. Please let me know if you don't like my proposal and we can come up with another plan.
The way to test this is documented in https://bugzilla.redhat.com/show_bug.cgi?id=1265786
Awesome, let me work with Pete.
Well, that was easy enough to check. Thanks for the additional instructions. Moving to Verified using 5.5.0.11 running on CFME Hyper-V https://10.16.6.164/miq_ae_customization/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2551