Bug 1265786
Summary: | State=<GetDiskInfoWindows> running raised exception: <execution expired> | |||
---|---|---|---|---|
Product: | Red Hat CloudForms Management Engine | Reporter: | mkanoor | |
Component: | Automate | Assignee: | mkanoor | |
Status: | CLOSED ERRATA | QA Contact: | Pete Savage <psavage> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 5.4.0 | CC: | cpelland, dajohnso, jdeubel, jhardy, jocarter, mfeifer, mkanoor, obarenbo, snansi, tfitzger | |
Target Milestone: | GA | Keywords: | ZStream | |
Target Release: | 5.4.3 | |||
Hardware: | All | |||
OS: | All | |||
Whiteboard: | ||||
Fixed In Version: | 5.4.3.0 | Doc Type: | Bug Fix | |
Doc Text: |
In the previous version of CloudForms Management Engine, a timeout and an execution expired exception could be raised if an automate model method ended due to a segmentation fault. The segmentation fault was caused by a bug in the GSSAPI gem that caused one pointer to be released twice. Multiple aspects of the code were fixed to handle the segmentation fault - the stderr and stdout streams were flushed in seperate threads, the automate method was terminated correctly if it timed out, and the stdout and stderr threads were terminated in the ensure block of the automate method. The segmentation fault itself was fixed by upgrading to a newer version of the GSSAPI library. Automate methods using GSSAPI calls work as expected in the new version of CloudForms Management Engine.
|
Story Points: | --- | |
Clone Of: | 1258648 | |||
: | 1265787 (view as bug list) | Environment: | ||
Last Closed: | 2015-10-22 14:34:06 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1258648 | |||
Bug Blocks: | 1265787 |
Comment 3
Pete Savage
2015-10-06 11:21:24 UTC
There are 2 things to check here Can the Automate Engine (1) Properly report the STDERR messages (2) Handle cleanup/termination of long running processes The failure was because we were not reading from the stderr. If you have the old version 5.3 version of CFME you would be able to just create a simple Automate method that just writes to the STDERR. This would cause a hang, because we are waiting on STDOUT before we start reading from STDERR. Some of it was fixed in 5.4 where we read from STDERR first before we read from STDOUT. The Automate method could have just one line 1000.times { STDERR.puts "Hello" } In 5.2/5.3 this method will cause a hang and after about 10 minutes which is the Queue timeout we would see a stack trace. The Automate method process would be left hanging in the system, if you run this multiple times these processes would accumulate. puts "Sleeping for 700 seconds" sleep(700) 1000.times { STDERR.puts "Hello" } In earlier version of 5.4 we had some logic to empty out the STDERR before we empty out STDOUT. The other test that should be on testing what happens to long running automate methods, that just sleep and don't respond. In the old code we would be leaving these processes in the system. With the new changes we will terminate the long running process and log a message and stop processing the rest of the automate request. The Queue timeout is 10 minutes, after 10 minutes we start cleaning up the automate methods. I took a stab at recreating this, Created the one the sleeps for 700 seconds, on an older appliance, 5.4.1 the code ran for the full time and was not cleaned up when run via the Simulate. On the newer appliance the same thing happened. Both methods ended with MIQ_OK and displayed their error log lines. Is Simulate able to test for the Queue timeout? Or do I need to invoke it differently? Simulate doesn't go thru the queue and hence the queue timeout won't help. You would have to do a provision request and insert this method as one of the state methods that get executed. Thanks mkanoor! I'll give this a go mkanoor, I did as you suggest, adding my method into the Acquiring IP Address, I see that the requests error out after around 10 minutes. I'm guessing this is the correct behaviour? This was the same in 5.4.3, and 5.4.1 Yes thats correct. Verified 5.4.3.0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1916.html |