Bug 495504

Summary: condor_master loses track of high availability condor_schedd and hangs
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED INSUFFICIENT_DATA QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 1.1.1   
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-01-15 02:45:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2009-04-13 14:54:18 UTC
Description of problem:

The condor_schedd hung and then exited. The condor_master tried to kill the condor_schedd but apparently could not. The result is the master lost track of the schedd and never gave up the high availability schedd lock. The condor_master also never exited itself, presumably because it thought the schedd was still running.

Somehow the schedd's reaper was not triggered in the master?
The master does not notice errors about non-existent pids it is tracking.


Version-Release number of selected component (if applicable):

# rpm -q qpidc qmf condor-qmf-plugins redhat-release
qpidc-0.5.752581-3.el5
qmf-0.5.752581-3.el5
condor-qmf-plugins-7.2.2-0.9.el5
redhat-release-5Server-5.3.0.3


How reproducible:

Unknown


Steps to Reproduce:
1. Run HA Schedd
2. Submit 200K jobs
3. Remove 200K jobs
*. condor_reconfig
*. pkill -TERM condor_master


Additional info:

From MasterLog, shows attempt to kill schedd, then errors when the schedd cannot be found:

4/12 11:07:44 ERROR: Child pid 5482 appears hung! Killing it hard.
4/12 12:33:32 Reconfiguring all running daemons.
4/12 12:33:32 Send_Signal error: kill(5482,1) failed: errno=3 No such process
4/12 12:33:32 attempt to connect to <10.16.44.233:46545> failed: Connection refu
sed (connect errno = 111).
4/12 12:33:32 Send_Signal: ERROR sending signal 1 (UPDATE_SCHEDD_AD) to pid 5482
 (no longer exists)
4/12 12:33:32 ERROR: failed to send SIGHUP to pid 5482
4/12 12:33:32 Sent SIGHUP to STARTD (pid 3849)
4/12 12:36:08 Got SIGQUIT.  Performing fast shutdown.
4/12 12:36:08 Send_Signal error: kill(5482,3) failed: errno=3 No such process
4/12 12:36:08 attempt to connect to <10.16.44.233:46545> failed: Connection refu
sed (connect errno = 111).
4/12 12:36:08 Send_Signal: ERROR sending signal 3 (SIGQUIT) to pid 5482 (no long
er exists)
4/12 12:36:08 ERROR: failed to send SIGQUIT to pid 5482
4/12 12:36:08 Sent SIGQUIT to STARTD (pid 3849)
4/12 12:36:08 The STARTD (pid 3849) exited with status 0
4/12 12:41:08 Timeout for fast shutdown has expired for SCHEDD.
4/12 12:41:08 Sent SIGKILL to SCHEDD (pid 5482) and all its children.
4/13 09:10:35 Got SIGTERM. Performing graceful shutdown.
4/13 09:10:35 Send_Signal error: kill(5482,15) failed: errno=3 No such process
4/13 09:10:35 attempt to connect to <10.16.44.233:46545> failed: Connection refu
sed (connect errno = 111).
4/13 09:10:35 Send_Signal: ERROR sending signal 15 (SIGTERM) to pid 5482 (no lon
ger exists)

Comment 1 Matthew Farrellee 2010-01-15 02:45:42 UTC
No reproduction possible. Attempts made even when tracing the schedd. All works as expected.

7.4.2-0.3