Bug 495504 - condor_master loses track of high availability condor_schedd and hangs
Summary: condor_master loses track of high availability condor_schedd and hangs
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.1.1
Hardware: All
OS: Linux
low
medium
Target Milestone: 1.3
: ---
Assignee: Matthew Farrellee
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-04-13 14:54 UTC by Matthew Farrellee
Modified: 2010-01-15 02:45 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-01-15 02:45:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Matthew Farrellee 2009-04-13 14:54:18 UTC
Description of problem:

The condor_schedd hung and then exited. The condor_master tried to kill the condor_schedd but apparently could not. The result is the master lost track of the schedd and never gave up the high availability schedd lock. The condor_master also never exited itself, presumably because it thought the schedd was still running.

Somehow the schedd's reaper was not triggered in the master?
The master does not notice errors about non-existent pids it is tracking.


Version-Release number of selected component (if applicable):

# rpm -q qpidc qmf condor-qmf-plugins redhat-release
qpidc-0.5.752581-3.el5
qmf-0.5.752581-3.el5
condor-qmf-plugins-7.2.2-0.9.el5
redhat-release-5Server-5.3.0.3


How reproducible:

Unknown


Steps to Reproduce:
1. Run HA Schedd
2. Submit 200K jobs
3. Remove 200K jobs
*. condor_reconfig
*. pkill -TERM condor_master


Additional info:

From MasterLog, shows attempt to kill schedd, then errors when the schedd cannot be found:

4/12 11:07:44 ERROR: Child pid 5482 appears hung! Killing it hard.
4/12 12:33:32 Reconfiguring all running daemons.
4/12 12:33:32 Send_Signal error: kill(5482,1) failed: errno=3 No such process
4/12 12:33:32 attempt to connect to <10.16.44.233:46545> failed: Connection refu
sed (connect errno = 111).
4/12 12:33:32 Send_Signal: ERROR sending signal 1 (UPDATE_SCHEDD_AD) to pid 5482
 (no longer exists)
4/12 12:33:32 ERROR: failed to send SIGHUP to pid 5482
4/12 12:33:32 Sent SIGHUP to STARTD (pid 3849)
4/12 12:36:08 Got SIGQUIT.  Performing fast shutdown.
4/12 12:36:08 Send_Signal error: kill(5482,3) failed: errno=3 No such process
4/12 12:36:08 attempt to connect to <10.16.44.233:46545> failed: Connection refu
sed (connect errno = 111).
4/12 12:36:08 Send_Signal: ERROR sending signal 3 (SIGQUIT) to pid 5482 (no long
er exists)
4/12 12:36:08 ERROR: failed to send SIGQUIT to pid 5482
4/12 12:36:08 Sent SIGQUIT to STARTD (pid 3849)
4/12 12:36:08 The STARTD (pid 3849) exited with status 0
4/12 12:41:08 Timeout for fast shutdown has expired for SCHEDD.
4/12 12:41:08 Sent SIGKILL to SCHEDD (pid 5482) and all its children.
4/13 09:10:35 Got SIGTERM. Performing graceful shutdown.
4/13 09:10:35 Send_Signal error: kill(5482,15) failed: errno=3 No such process
4/13 09:10:35 attempt to connect to <10.16.44.233:46545> failed: Connection refu
sed (connect errno = 111).
4/13 09:10:35 Send_Signal: ERROR sending signal 15 (SIGTERM) to pid 5482 (no lon
ger exists)

Comment 1 Matthew Farrellee 2010-01-15 02:45:42 UTC
No reproduction possible. Attempts made even when tracing the schedd. All works as expected.

7.4.2-0.3


Note You need to log in before you can comment on or make changes to this bug.