Do we have a simple reproducer for this?
There are 2 things to check here Can the Automate Engine (1) Properly report the STDERR messages (2) Handle cleanup/termination of long running processes The failure was because we were not reading from the stderr. If you have the old version 5.3 version of CFME you would be able to just create a simple Automate method that just writes to the STDERR. This would cause a hang, because we are waiting on STDOUT before we start reading from STDERR. Some of it was fixed in 5.4 where we read from STDERR first before we read from STDOUT. The Automate method could have just one line 1000.times { STDERR.puts "Hello" } In 5.2/5.3 this method will cause a hang and after about 10 minutes which is the Queue timeout we would see a stack trace. The Automate method process would be left hanging in the system, if you run this multiple times these processes would accumulate. puts "Sleeping for 700 seconds" sleep(700) 1000.times { STDERR.puts "Hello" } In earlier version of 5.4 we had some logic to empty out the STDERR before we empty out STDOUT. The other test that should be on testing what happens to long running automate methods, that just sleep and don't respond. In the old code we would be leaving these processes in the system. With the new changes we will terminate the long running process and log a message and stop processing the rest of the automate request. The Queue timeout is 10 minutes, after 10 minutes we start cleaning up the automate methods.
I took a stab at recreating this, Created the one the sleeps for 700 seconds, on an older appliance, 5.4.1 the code ran for the full time and was not cleaned up when run via the Simulate. On the newer appliance the same thing happened. Both methods ended with MIQ_OK and displayed their error log lines. Is Simulate able to test for the Queue timeout? Or do I need to invoke it differently?
Simulate doesn't go thru the queue and hence the queue timeout won't help. You would have to do a provision request and insert this method as one of the state methods that get executed.
Thanks mkanoor! I'll give this a go
mkanoor, I did as you suggest, adding my method into the Acquiring IP Address, I see that the requests error out after around 10 minutes. I'm guessing this is the correct behaviour? This was the same in 5.4.3, and 5.4.1
Yes thats correct.
Verified 5.4.3.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1916.html