Bug 1265786

Summary: State=<GetDiskInfoWindows> running raised exception: <execution expired>
Product: Red Hat CloudForms Management Engine Reporter: mkanoor
Component: AutomateAssignee: mkanoor
Status: CLOSED ERRATA QA Contact: Pete Savage <psavage>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 5.4.0CC: cpelland, dajohnso, jdeubel, jhardy, jocarter, mfeifer, mkanoor, obarenbo, snansi, tfitzger
Target Milestone: GAKeywords: ZStream
Target Release: 5.4.3   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: 5.4.3.0 Doc Type: Bug Fix
Doc Text:
In the previous version of CloudForms Management Engine, a timeout and an execution expired exception could be raised if an automate model method ended due to a segmentation fault. The segmentation fault was caused by a bug in the GSSAPI gem that caused one pointer to be released twice. Multiple aspects of the code were fixed to handle the segmentation fault - the stderr and stdout streams were flushed in seperate threads, the automate method was terminated correctly if it timed out, and the stdout and stderr threads were terminated in the ensure block of the automate method. The segmentation fault itself was fixed by upgrading to a newer version of the GSSAPI library. Automate methods using GSSAPI calls work as expected in the new version of CloudForms Management Engine.
Story Points: ---
Clone Of: 1258648
: 1265787 (view as bug list) Environment:
Last Closed: 2015-10-22 14:34:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1258648    
Bug Blocks: 1265787    

Comment 3 Pete Savage 2015-10-06 11:21:24 UTC
Do we have a simple reproducer for this?

Comment 4 mkanoor 2015-10-06 15:03:22 UTC
There are 2 things to check here
Can the Automate Engine
(1) Properly report the STDERR messages
(2) Handle cleanup/termination of long running processes

The failure was because we were not reading from the stderr.
If you have the old version 5.3 version of CFME you would be able to just create a simple Automate method that just writes to the STDERR.
This would cause a hang, because we are waiting on STDOUT before we start reading from STDERR. Some of it was fixed in 5.4 where we read from STDERR first before we read from STDOUT.

The Automate method could have just one line

1000.times { STDERR.puts "Hello" }

In 5.2/5.3 this method will cause a hang and after about 10 minutes which is the Queue timeout we would see a stack trace. The Automate method process would be left hanging in the system, if you run this multiple times these processes would accumulate.


puts "Sleeping for 700 seconds"
sleep(700)
1000.times { STDERR.puts "Hello" }

In earlier version of 5.4 we had some logic to empty out the STDERR before we empty out STDOUT.

The other test that should be on testing what happens to long running automate methods, that just sleep and don't respond. In the old code we would be leaving these processes in the system. With the new changes we will terminate the long running process and log a message and stop processing the rest of the automate request.

The Queue timeout is 10 minutes, after 10 minutes we start cleaning up the automate methods.

Comment 5 Pete Savage 2015-10-09 09:04:44 UTC
I took a stab at recreating this,

Created the one the sleeps for 700 seconds, on an older appliance, 5.4.1 the code ran for the full time and was not cleaned up when run via the Simulate.

On the newer appliance the same thing happened. Both methods ended with MIQ_OK and displayed their error log lines.

Is Simulate able to test for the Queue timeout? Or do I need to invoke it differently?

Comment 6 mkanoor 2015-10-09 13:33:09 UTC
Simulate doesn't go thru the queue and hence the queue timeout won't help. You would have to do a provision request and insert this method as one of the state methods that get executed.

Comment 7 Pete Savage 2015-10-09 13:38:40 UTC
Thanks mkanoor! I'll give this a go

Comment 8 Pete Savage 2015-10-09 20:00:05 UTC
mkanoor, I did as you suggest, adding my method into the Acquiring IP Address, I see that the requests error out after around 10 minutes. I'm guessing this is the correct behaviour? This was the same in 5.4.3, and 5.4.1

Comment 9 mkanoor 2015-10-12 18:44:33 UTC
Yes thats correct.

Comment 10 Pete Savage 2015-10-12 18:45:13 UTC
Verified 5.4.3.0

Comment 12 errata-xmlrpc 2015-10-22 14:34:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1916.html