Bug 829455

Summary:	allow script plugin to execute an operation but not wait for it to finish
Product:	[Other] RHQ Project	Reporter:	John Mazzitelli <mazz>
Component:	Plugins	Assignee:	John Mazzitelli <mazz>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Mike Foley <mfoley>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	hrupp
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-09-01 19:18:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description John Mazzitelli 2012-06-06 19:36:28 UTC

A customer use-case is as follows:

1) Create a shell script that uses the RHQ CLI/remote API to call into the RHQ server to trigger operations on resources.
2) Import that shell script as a "Script Server" resource itself - using the script plugin
3) Use the "execute" operation on that Script Server resource to run it

This means that the RHQ Agent will be executing an operation (the Script Server operation) but could quite possibly be asked to invoke another operation (as a result of what the script asks the RHQ Server to do). It has been seen that this causes a deadlock situation because the second operation is waiting to be started until the first operation finishes, but that can't finish until all its work is done, which includes finishing the second operation.

I'm not sure why this deadlock occurs, because the agent should be able to invoke operations in parallel when the operations are on different resources. I'd have to try to replicate this to get a definitive answer of why its happening. But, suffice it to say, whatever the customer's script it doing, this has been seen.

The workaround is for the script to launch its work in a background process, thus exiting the script quickly (leaving the background process to do the real work).

We could add an enhancement to the script plugin - either enhance the current script plugin's "execute" operation OR add a new operation. This new operation could be configurable (via operation parameters) to not wait for the operation to finish before returning back to the server and ending the operation.

Several downsides to this:

a) you will not be able to know the script's true exit code because it won't wait for the real work to complete
b) you will not be able to know the script's output of that real work
c) If that background process gets hung, we won't know and it won't die because we won't have any ability to kill it (today, if the hardcoded 1 hour timeout is exceeded, we do attempt to kill the process trying to clean up the process table).

Comment 1 John Mazzitelli 2012-06-07 18:56:53 UTC

git commit to master: 0c0ff7a

this changes the script plugin's descriptor and component class. It allows for three new optional parameters to the execute operation:

  <c:simple-property name="waitTime" required="false" type="long" description="The number of seconds to wait for the process to end. If 0 or less, the operation will return immediately without waiting for the script to complete. Default is one hour." />
  <c:simple-property name="captureOutput" required="false" type="boolean" description="If true, the script's output will be captured and returned. Default is true." />
  <c:simple-property name="killOnTimeout" required="false" type="boolean" description="If true, and if waitTime parameter is greater than 0, the script will be forcibly terminated if the wait time expires before the script finishes. In other words, the script process will be killed if it times out. Default is true." />

For the use-case that triggered this BZ, you can set waitTime to 0 - this invokes the script in a fire-and-forget manner. It launches the script, but returns immediately and doesn't wait for it to finish.

Comment 2 John Mazzitelli 2012-06-11 21:39:40 UTC

see bug 830996 that addresses a fix for the use case described in this issue's description.

Comment 3 Heiko W. Rupp 2013-09-01 19:18:44 UTC

Bulk closing of BZs that have no target version set, but which are ON_QA for more than a year and thus are in production for a long time.