Bug 828983

Summary: condor resource agent start operation should have verification of startup
Product: Red Hat Enterprise MRG Reporter: Robert Rati <rrati>
Component: condor-cluster-resource-agentAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Tomas Rusnak <trusnak>
Severity: medium Docs Contact:
Priority: medium    
Version: DevelopmentCC: matt, mkudlej, trusnak, tstclair
Target Milestone: 2.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: condor-7.8.6-0.1 Doc Type: Bug Fix
Doc Text:
Cause: The condor resource agent used with RHHA wouldn't verify that a daemon had started during a start operation Consequence: The start operation could report success when in fact the daemon didn't begin to start Fix: The resource agent now waits 10 seconds to see that the process starts Result: When a start operation reports success, the daemon always will hvae started
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-06 18:44:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robert Rati 2012-06-05 17:23:50 UTC
Description of problem:
Currently, the condor RA uses the return value of daemon to determine if the process started up.  This isn't that reliable for the process state.  Instead, the start operation should perform a check for the condor_schedd process and return success/failure based upon the existence of the process.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Robert Rati 2012-06-20 17:32:09 UTC
The RA will now wait 10 seconds for the condor process to appear before returning success.

Fixed upstream on:
V7_6-branch

Comment 5 Tomas Rusnak 2013-02-20 15:10:52 UTC
Tested on:
RH-7.8.8-0.4.1

# ls -la /usr/sbin/condor_schedd
-rwxr-xr-x. 1 root root 12 Feb 20 10:00 /usr/sbin/condor_schedd

# clusvcadm -R "HA Schedd HASchedd1"
Local machine trying to restart service:HA Schedd HASchedd1...Success

# tail -f /var/log/cluster/rgmanager.log
Feb 20 10:07:45 rgmanager [condor] Stopping condor_schedd HASchedd1
Feb 20 10:07:45 rgmanager [condor] Starting condor_schedd HASchedd1
Feb 20 10:07:56 rgmanager [condor] Failed to start condor_schedd HASchedd1
Feb 20 10:08:05 rgmanager [netfs] Checking fs "Job Queue for HASchedd1", Level 0
Feb 20 10:08:15 rgmanager status on condor "HASchedd1" returned 7 (unspecified)
Feb 20 10:08:15 rgmanager [condor] Stopping condor_schedd HASchedd1
Feb 20 10:08:15 rgmanager [condor] Starting condor_schedd HASchedd1
Feb 20 10:08:25 rgmanager [netfs] Checking fs "Job Queue for HASchedd2", Level 0
Feb 20 10:08:25 rgmanager [netfs] Checking fs "Job Queue for HASchedd3", Level 0
Feb 20 10:08:26 rgmanager [condor] Failed to start condor_schedd HASchedd1

>>> VERIFIED

Comment 7 errata-xmlrpc 2013-03-06 18:44:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html