Bug 828983

Summary:	condor resource agent start operation should have verification of startup
Product:	Red Hat Enterprise MRG	Reporter:	Robert Rati <rrati>
Component:	condor-cluster-resource-agent	Assignee:	Robert Rati <rrati>
Status:	CLOSED ERRATA	QA Contact:	Tomas Rusnak <trusnak>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	Development	CC:	matt, mkudlej, trusnak, tstclair
Target Milestone:	2.3
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	condor-7.8.6-0.1	Doc Type:	Bug Fix
Doc Text:	Cause: The condor resource agent used with RHHA wouldn't verify that a daemon had started during a start operation Consequence: The start operation could report success when in fact the daemon didn't begin to start Fix: The resource agent now waits 10 seconds to see that the process starts Result: When a start operation reports success, the daemon always will hvae started	Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-03-06 18:44:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Robert Rati 2012-06-05 17:23:50 UTC

Description of problem:
Currently, the condor RA uses the return value of daemon to determine if the process started up.  This isn't that reliable for the process state.  Instead, the start operation should perform a check for the condor_schedd process and return success/failure based upon the existence of the process.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Robert Rati 2012-06-20 17:32:09 UTC

The RA will now wait 10 seconds for the condor process to appear before returning success.

Fixed upstream on:
V7_6-branch

Comment 5 Tomas Rusnak 2013-02-20 15:10:52 UTC

Tested on:
RH-7.8.8-0.4.1

# ls -la /usr/sbin/condor_schedd
-rwxr-xr-x. 1 root root 12 Feb 20 10:00 /usr/sbin/condor_schedd

# clusvcadm -R "HA Schedd HASchedd1"
Local machine trying to restart service:HA Schedd HASchedd1...Success

# tail -f /var/log/cluster/rgmanager.log
Feb 20 10:07:45 rgmanager [condor] Stopping condor_schedd HASchedd1
Feb 20 10:07:45 rgmanager [condor] Starting condor_schedd HASchedd1
Feb 20 10:07:56 rgmanager [condor] Failed to start condor_schedd HASchedd1
Feb 20 10:08:05 rgmanager [netfs] Checking fs "Job Queue for HASchedd1", Level 0
Feb 20 10:08:15 rgmanager status on condor "HASchedd1" returned 7 (unspecified)
Feb 20 10:08:15 rgmanager [condor] Stopping condor_schedd HASchedd1
Feb 20 10:08:15 rgmanager [condor] Starting condor_schedd HASchedd1
Feb 20 10:08:25 rgmanager [netfs] Checking fs "Job Queue for HASchedd2", Level 0
Feb 20 10:08:25 rgmanager [netfs] Checking fs "Job Queue for HASchedd3", Level 0
Feb 20 10:08:26 rgmanager [condor] Failed to start condor_schedd HASchedd1

>>> VERIFIED

Comment 7 errata-xmlrpc 2013-03-06 18:44:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html