Description of problem: The condor RA produces metadata that only provides a timeout of 5 seconds for actions like start, stop, and restart. For start this isn't as big a beal since the daemon start is a bit of fire and forget, but stop allows a 600 second timeout in case of a slow shutdown. In the case of a long shutdown, the rgmanager will assume failure after 5 seconds rather than allowing the RA to complete the shutdown process. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Corrected the timeout values in the metadata. Fixed upstream on: V7_6-branch
Where can I find Condor RA metadata? How can I test that timeouts have changed?
The RA is like an initscript, so run "condor meta-data" to get the metadata.
/usr/share/cluster/condor.sh
# /usr/share/cluster/condor.sh meta-data <?xml version="1.0"?> <!DOCTYPE resource-agent SYSTEM "ra-api-1-modified.dtd"> <resource-agent version="rgmanager 2.0" name="condor"> <version>1.0</version> <longdesc lang="en"> condor resource agent </longdesc> <shortdesc lang="en"> condor resource agent </shortdesc> <parameters> <parameter name="name" unique="1" primary="1"> <longdesc lang="en"> The name passed to the condor subsystem type </longdesc> <shortdesc lang="en"> condor subsystem name </shortdesc> <content type="string"/> </parameter> <parameter name="type" required="1"> <longdesc lang="en"> The type of condor subsystem </longdesc> <shortdesc lang="en"> condor subsystem type </shortdesc> <content type="string"/> </parameter> </parameters> <actions> <action name="start" timeout="15"/> <action name="stop" timeout="605"/> <action name="recover" timeout="630"/> <action name="monitor" interval="30" timeout="5"/> <action name="status" interval="30" timeout="5"/> <action name="meta-data" timeout="5"/> <action name="validate-all" timeout="5"/> </actions> </resource-agent> Metadata updated. condor-cluster-resource-agent-7.8.8-0.1.el6 >>> VERIFIED