Bug 1258648 - State=<GetDiskInfoWindows> running raised exception: <execution expired>
State=<GetDiskInfoWindows> running raised exception: <execution expired>
Status: CLOSED ERRATA
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Automate (Show other bugs)
5.3.0
All All
unspecified Severity urgent
: GA
: 5.5.0
Assigned To: mkanoor
Jeff Teehan
: ZStream
Depends On:
Blocks: 1265786 1265787
  Show dependency treegraph
 
Reported: 2015-08-31 18:25 EDT by Josh Carter
Modified: 2017-09-20 15:22 EDT (History)
10 users (show)

See Also:
Fixed In Version: 5.5.0.3
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1265786 (view as bug list)
Environment:
Last Closed: 2015-12-08 08:29:00 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Proposed Automate GSSAPI SEGFAULT Fault (4.63 KB, text/plain)
2015-09-08 16:01 EDT, mkanoor
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:2551 normal SHIPPED_LIVE Moderate: CFME 5.5.0 bug fixes and enhancement update 2015-12-08 12:58:09 EST

  None (edit)
Description Josh Carter 2015-08-31 18:25:16 EDT
Description of problem:

Model method ends and does not start next method. Instead CloudForms will wait until the timeout and error.

[----] I, [2015-08-31T10:51:06.578562 #25294:6466b34]  INFO -- : Q-task_id([automation_task_11000000003179]) <AEMethod get_disk_data> get_disk_info - CloudForms Automate Method Ended
[----] E, [2015-08-31T11:00:57.983707 #25294:531010] ERROR -- : Q-task_id([automation_task_11000000003179]) State=<GetDiskInfoWindows> running  raised exception: <execution expired>
[----] W, [2015-08-31T11:00:57.984068 #25294:531010]  WARN -- : Q-task_id([automation_task_11000000003179]) Error in State=[GetDiskInfoWindows]


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Comment 4 mkanoor 2015-09-01 12:13:32 EDT
The log snippets don't seem to match, from the error message in the log snippet
raised exception: <execution expired>
Indicates that the Automate method had errors when it was talking to external servers using the rest-client or the winrm.

Suggested that they rescue Timeout::Error and either retry or fail appropriately based on the customer needs.
Comment 9 mkanoor 2015-09-01 18:31:53 EDT
Josh,
Can we please get the log with the backtrace included.
Thanks
Comment 12 mkanoor 2015-09-02 16:07:49 EDT
On further analysis we found that every automate request is constrained by the queue message timeout. The automate request has to end within the msg_timeout for the queue entry, which currently defaults to 600 seconds (10 minutes). If the request doesn't end in 10 minutes, we get a timeout error, which because of a bug in our code doesn't get propagated properly and the cleanup doesn't happen. If we have an automate method that gets dispatched as part of the request the method becomes a detached process (zombies) which will not get cleaned and will accumulate in the system till the next restart of the OS. We have to fix the propagation of the timeout error from from the request and delete any automate process that we might have launched.

In the customers scenario the automate method that uses winrm is taking upwards of 10 minutes to execute when the queue terminates the automate request. Even after we fix the propogation of the error code we still would be hitting the timeout. It is recommended that the customer write the automate methods as separate steps that they can use from inside Automate state machine which has a built in retry logic.
Comment 15 mkanoor 2015-09-04 16:25:07 EDT
Analysis so far.

The crux of the issue "seems" to be a bug in the gssapi gem releases the same pointer twice and gets a segfault. This causes reams of data being written to the STDERR.

This is a known bug listed here
https://github.com/zenchild/gssapi/issues/12

The customer is using CFME 5.3 which uses an older version of gssapi (1.1.2) which does have the bug fixed.

This bug manifests itself in CFME in the following ways

(1) Since we were not reading from STDERR, the reams of data that got generated during the segfault gets deadlocked waiting for someone to empty the STDERR channel. The automate method waits.

(2) After 10 minutes the Queue timeout kicks in and terminates the Parent thread that had started the Automate Request, because of bugs in the way that timeout was being handled we didn't cleanup the Automate method and left it hanging as a detached process.

We are trying to see if we can apply the patch to the gssapi library to confirm that this is the source of the problem. The Automate method that they customer uses the winrm and gossip gems.
Comment 16 mkanoor 2015-09-08 09:15:39 EDT
In the above comment I incorrectly state that it is fixed in gssapi 1.1.2. It has not been fixed in 1.1.2. It has only been fixed in 1.2
Comment 18 mkanoor 2015-09-08 16:01:30 EDT
Created attachment 1071477 [details]
Proposed Automate GSSAPI SEGFAULT Fault

This code can be inserted on top of the Automate methods that use the GSSAPI version 1.1.2 gem. It monkey patches the GSSAPI context methods so that we check the context before creating new ones as explained in the Pull Request.
This should only be used if the GSSAPI gem cannot be updated on all servers.
Using the Automate method patch saves the trouble of updating all servers to have the latest gssapi.
Version 1.2 of gssapi has this issue fixed.
Comment 24 mkanoor 2015-09-23 17:50:47 EDT
In order to test this issue in 5.4 you can have a long running automate method that does 2 things

(1) Writes lots of information to stderr
(2) Sleeps over 10 minutes, which is the default queue timeout. At the end of 10 minutes the Queue worker will terminate the automate thread leaving behind orphaned methods.

The fixes in here will make sure that stderr gets drained properly and long running methods don't leave orphaned processes around. The cleanup after 10 minutes should terminate the methods and there should be appropriate log messages.
Comment 25 CFME Bot 2015-09-24 00:22:07 EDT
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/a9470c323e3330bab9d519ff3e51d0e24b96e5fa

commit a9470c323e3330bab9d519ff3e51d0e24b96e5fa
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Tue Sep 15 16:59:39 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Mon Sep 21 18:16:44 2015 -0400

    Implemented PR feedback
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648
    
    Use chomp instead of strip
    Removed the extra require

 lib/miq_automation_engine/engine/miq_ae_method.rb | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)
Comment 26 CFME Bot 2015-09-24 00:27:22 EDT
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/0fa5243e5bfacbbc57382b8f2f2bb191dd090ac4

commit 0fa5243e5bfacbbc57382b8f2f2bb191dd090ac4
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Fri Sep 18 16:08:28 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Mon Sep 21 18:16:44 2015 -0400

    PR Review changes
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648

 lib/miq_automation_engine/engine/miq_ae_method.rb | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)
Comment 27 CFME Bot 2015-09-24 00:28:07 EDT
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/47266059a5ba66385916a53ba3d3ef6ae50a196d

commit 47266059a5ba66385916a53ba3d3ef6ae50a196d
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Thu Sep 10 11:23:48 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Mon Sep 21 18:16:44 2015 -0400

    Fixed Rubocop warnings
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648

 spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb | 10 +++++-----
 spec/support/automation_spec_helper.rb                        |  8 ++++----
 2 files changed, 9 insertions(+), 9 deletions(-)
Comment 28 CFME Bot 2015-09-24 00:28:37 EDT
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/def6df099dc070d16d276171221f65948fcd9b61

commit def6df099dc070d16d276171221f65948fcd9b61
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Thu Sep 10 10:07:12 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Mon Sep 21 18:16:44 2015 -0400

    Fixed Automate Method handling
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648
    
    Addressed the following issues
    (1) Flush stderr and stdout in separate threads, prior to this it would
        cause deadlock if the stderr was not drained.
    (2) Ensure block makes sure that the Automate method is terminated
        if it doesn't respond by the :msg_timeout specified in the queue.
        This used to leave orphaned processes.
    (3) Terminate the stdout and stderr reading threads in the ensure block.
    
    Added a spec with automate method that writes to stdout/stderr and sleeps
    waiting to be terminated.

 lib/miq_automation_engine/engine/miq_ae_method.rb  |  62 +++++++++---
 .../miq_ae_method_dispatch_spec.rb                 | 105 +++++++++++++++++++++
 spec/support/automation_spec_helper.rb             |   8 +-
 3 files changed, 160 insertions(+), 15 deletions(-)
 create mode 100644 spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb
Comment 29 CFME Bot 2015-09-24 00:29:52 EDT
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/88232c7214a6070e88856a3c64b4c61ed3e67574

commit 88232c7214a6070e88856a3c64b4c61ed3e67574
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Mon Sep 14 14:33:32 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Mon Sep 21 18:16:44 2015 -0400

    Fixed based on PR comments
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648
    
    Combined the shared code in invoke_external and run_ruby_method into
    a new function called run_method. Added a new function that does the
    cleanup (terminate the process and exit the threads)

 lib/miq_automation_engine/engine/miq_ae_method.rb | 151 ++++++++++------------
 1 file changed, 65 insertions(+), 86 deletions(-)
Comment 30 CFME Bot 2015-09-24 00:40:23 EDT
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/580cd638b0be9ee430209ebe6b93adc25c81adf1

commit 580cd638b0be9ee430209ebe6b93adc25c81adf1
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Mon Sep 21 10:11:29 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Mon Sep 21 18:16:44 2015 -0400

    Added comments around long running method
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648
    
    Reduce the sleep time for the unresponsive method spec to 1 minute.
    Added comments about the sleep in the Automate method

 spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)
Comment 32 CFME Bot 2015-09-24 07:19:40 EDT
Detected commit referencing this ticket while ticket status is POST.
Comment 33 CFME Bot 2015-09-24 10:07:56 EDT
New commit detected on cfme/5.3.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=5159a2c3487601d8337219135972a9d4b9c83654

commit 5159a2c3487601d8337219135972a9d4b9c83654
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Thu Sep 10 10:07:12 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Wed Sep 23 14:44:05 2015 -0400

    Fixed Automate Method handling
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648
    
    Addressed the following issues
    (1) Flush stderr and stdout in separate threads, prior to this it would
        cause deadlock if the stderr was not drained.
    (2) Ensure block makes sure that the Automate method is terminated
        if it doesn't respond by the :msg_timeout specified in the queue.
        This used to leave orphaned processes.
    (3) Terminate the stdout and stderr reading threads in the ensure block.
    
    Added a spec with automate method that writes to stdout/stderr and sleeps
    waiting to be terminated.

 .../miq_ae_method_dispatch_spec.rb                 | 105 +++++++++++++++++++++
 .../miq_automation_engine/engine/miq_ae_method.rb  |  71 ++++++++++----
 vmdb/spec/support/automation_spec_helper.rb        |  46 +++++++++
 3 files changed, 202 insertions(+), 20 deletions(-)
 create mode 100644 spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb
Comment 34 CFME Bot 2015-09-24 10:08:00 EDT
New commit detected on cfme/5.3.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=4a50370eadcda11e59141028655111752a3b30c4

commit 4a50370eadcda11e59141028655111752a3b30c4
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Thu Sep 10 11:23:48 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Wed Sep 23 14:44:05 2015 -0400

    Fixed Rubocop warnings
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648

 spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb | 10 +++++-----
 vmdb/spec/support/automation_spec_helper.rb                   |  8 ++++----
 2 files changed, 9 insertions(+), 9 deletions(-)
Comment 35 CFME Bot 2015-09-24 10:08:06 EDT
New commit detected on cfme/5.3.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=b173e756c2a73251e40a01de16a1b97d2fcd8d91

commit b173e756c2a73251e40a01de16a1b97d2fcd8d91
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Mon Sep 14 14:33:32 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Wed Sep 23 14:46:09 2015 -0400

    Fixed based on PR comments
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648
    
    Combined the shared code in invoke_external and run_ruby_method into
    a new function called run_method. Added a new function that does the
    cleanup (terminate the process and exit the threads)

 .../miq_automation_engine/engine/miq_ae_method.rb  | 151 +++++++++------------
 1 file changed, 65 insertions(+), 86 deletions(-)
Comment 36 CFME Bot 2015-09-24 10:08:11 EDT
New commit detected on cfme/5.3.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=110cabaa0be9a79706c4f2988673a61f7c148512

commit 110cabaa0be9a79706c4f2988673a61f7c148512
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Tue Sep 15 16:59:39 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Wed Sep 23 14:46:09 2015 -0400

    Implemented PR feedback
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648
    
    Use chomp instead of strip
    Removed the extra require

 vmdb/lib/miq_automation_engine/engine/miq_ae_method.rb | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)
Comment 37 CFME Bot 2015-09-24 10:08:16 EDT
New commit detected on cfme/5.3.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=e5cd029f919350e1ae294857611e66fd56cb39ff

commit e5cd029f919350e1ae294857611e66fd56cb39ff
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Fri Sep 18 16:08:28 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Wed Sep 23 14:46:09 2015 -0400

    PR Review changes
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648

 vmdb/lib/miq_automation_engine/engine/miq_ae_method.rb | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)
Comment 38 CFME Bot 2015-09-24 10:08:21 EDT
New commit detected on cfme/5.3.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=c869617a827f827af086fe85e59473482257e1ad

commit c869617a827f827af086fe85e59473482257e1ad
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Mon Sep 21 10:11:29 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Wed Sep 23 14:46:09 2015 -0400

    Added comments around long running method
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648
    
    Reduce the sleep time for the unresponsive method spec to 1 minute.
    Added comments about the sleep in the Automate method

 spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)
Comment 39 CFME Bot 2015-09-24 10:08:26 EDT
New commit detected on cfme/5.3.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=2c830c6f52f08d0d791577db257dcb05ad15fc0c

commit 2c830c6f52f08d0d791577db257dcb05ad15fc0c
Author:     Madhu Kanoor <mkanoor@redhat.com>
AuthorDate: Tue Sep 22 14:28:17 2015 -0400
Commit:     Madhu Kanoor <mkanoor@redhat.com>
CommitDate: Wed Sep 23 14:46:10 2015 -0400

    Moved new spec file to correct directory
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1258648

 .../miq_ae_method_dispatch_spec.rb                 | 110 ---------------------
 .../miq_ae_method_dispatch_spec.rb                 | 110 +++++++++++++++++++++
 2 files changed, 110 insertions(+), 110 deletions(-)
 delete mode 100644 spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb
 create mode 100644 vmdb/spec/lib/miq_automation_engine/miq_ae_method_dispatch_spec.rb
Comment 40 Jeff Teehan 2015-11-17 13:21:27 EST
I'm not going to be able to say definitely that this issue is resolved as I've been seen or been able to reproduce it.  If it's still in there, it's hiding.

I can say that none of the above errors appear on any of my appliances and I run and maintain a large collection of windows VMs.

Additionally, I look at the code changes and it seems that if we are timing out stale issues after 60 seconds, we're probably not going to run into the 600 second timeout.

So, I'm moving this to Verified.  Please let me know if you don't like my proposal and we can come up with another plan.
Comment 41 mkanoor 2015-11-17 16:47:51 EST
The way to test this is documented in 
https://bugzilla.redhat.com/show_bug.cgi?id=1265786
Comment 42 Jeff Teehan 2015-11-17 16:55:18 EST
Awesome, let me work with Pete.
Comment 43 Jeff Teehan 2015-11-20 15:56:13 EST
Well, that was easy enough to check.  Thanks for the additional instructions.

Moving to Verified using 5.5.0.11 running on CFME Hyper-V https://10.16.6.164/miq_ae_customization/
Comment 45 errata-xmlrpc 2015-12-08 08:29:00 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2551

Note You need to log in before you can comment on or make changes to this bug.