Bug 495504 - condor_master loses track of high availability condor_schedd and hangs
condor_master loses track of high availability condor_schedd and hangs
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
1.1.1
All Linux
low Severity medium
: 1.3
: ---
Assigned To: Matthew Farrellee
MRG Quality Engineering
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-13 10:54 EDT by Matthew Farrellee
Modified: 2010-01-14 21:45 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-01-14 21:45:42 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Matthew Farrellee 2009-04-13 10:54:18 EDT
Description of problem:

The condor_schedd hung and then exited. The condor_master tried to kill the condor_schedd but apparently could not. The result is the master lost track of the schedd and never gave up the high availability schedd lock. The condor_master also never exited itself, presumably because it thought the schedd was still running.

Somehow the schedd's reaper was not triggered in the master?
The master does not notice errors about non-existent pids it is tracking.


Version-Release number of selected component (if applicable):

# rpm -q qpidc qmf condor-qmf-plugins redhat-release
qpidc-0.5.752581-3.el5
qmf-0.5.752581-3.el5
condor-qmf-plugins-7.2.2-0.9.el5
redhat-release-5Server-5.3.0.3


How reproducible:

Unknown


Steps to Reproduce:
1. Run HA Schedd
2. Submit 200K jobs
3. Remove 200K jobs
*. condor_reconfig
*. pkill -TERM condor_master


Additional info:

From MasterLog, shows attempt to kill schedd, then errors when the schedd cannot be found:

4/12 11:07:44 ERROR: Child pid 5482 appears hung! Killing it hard.
4/12 12:33:32 Reconfiguring all running daemons.
4/12 12:33:32 Send_Signal error: kill(5482,1) failed: errno=3 No such process
4/12 12:33:32 attempt to connect to <10.16.44.233:46545> failed: Connection refu
sed (connect errno = 111).
4/12 12:33:32 Send_Signal: ERROR sending signal 1 (UPDATE_SCHEDD_AD) to pid 5482
 (no longer exists)
4/12 12:33:32 ERROR: failed to send SIGHUP to pid 5482
4/12 12:33:32 Sent SIGHUP to STARTD (pid 3849)
4/12 12:36:08 Got SIGQUIT.  Performing fast shutdown.
4/12 12:36:08 Send_Signal error: kill(5482,3) failed: errno=3 No such process
4/12 12:36:08 attempt to connect to <10.16.44.233:46545> failed: Connection refu
sed (connect errno = 111).
4/12 12:36:08 Send_Signal: ERROR sending signal 3 (SIGQUIT) to pid 5482 (no long
er exists)
4/12 12:36:08 ERROR: failed to send SIGQUIT to pid 5482
4/12 12:36:08 Sent SIGQUIT to STARTD (pid 3849)
4/12 12:36:08 The STARTD (pid 3849) exited with status 0
4/12 12:41:08 Timeout for fast shutdown has expired for SCHEDD.
4/12 12:41:08 Sent SIGKILL to SCHEDD (pid 5482) and all its children.
4/13 09:10:35 Got SIGTERM. Performing graceful shutdown.
4/13 09:10:35 Send_Signal error: kill(5482,15) failed: errno=3 No such process
4/13 09:10:35 attempt to connect to <10.16.44.233:46545> failed: Connection refu
sed (connect errno = 111).
4/13 09:10:35 Send_Signal: ERROR sending signal 15 (SIGTERM) to pid 5482 (no lon
ger exists)
Comment 1 Matthew Farrellee 2010-01-14 21:45:42 EST
No reproduction possible. Attempts made even when tracing the schedd. All works as expected.

7.4.2-0.3

Note You need to log in before you can comment on or make changes to this bug.