Description of problem: The condor_schedd hung and then exited. The condor_master tried to kill the condor_schedd but apparently could not. The result is the master lost track of the schedd and never gave up the high availability schedd lock. The condor_master also never exited itself, presumably because it thought the schedd was still running. Somehow the schedd's reaper was not triggered in the master? The master does not notice errors about non-existent pids it is tracking. Version-Release number of selected component (if applicable): # rpm -q qpidc qmf condor-qmf-plugins redhat-release qpidc-0.5.752581-3.el5 qmf-0.5.752581-3.el5 condor-qmf-plugins-7.2.2-0.9.el5 redhat-release-5Server-5.3.0.3 How reproducible: Unknown Steps to Reproduce: 1. Run HA Schedd 2. Submit 200K jobs 3. Remove 200K jobs *. condor_reconfig *. pkill -TERM condor_master Additional info: From MasterLog, shows attempt to kill schedd, then errors when the schedd cannot be found: 4/12 11:07:44 ERROR: Child pid 5482 appears hung! Killing it hard. 4/12 12:33:32 Reconfiguring all running daemons. 4/12 12:33:32 Send_Signal error: kill(5482,1) failed: errno=3 No such process 4/12 12:33:32 attempt to connect to <10.16.44.233:46545> failed: Connection refu sed (connect errno = 111). 4/12 12:33:32 Send_Signal: ERROR sending signal 1 (UPDATE_SCHEDD_AD) to pid 5482 (no longer exists) 4/12 12:33:32 ERROR: failed to send SIGHUP to pid 5482 4/12 12:33:32 Sent SIGHUP to STARTD (pid 3849) 4/12 12:36:08 Got SIGQUIT. Performing fast shutdown. 4/12 12:36:08 Send_Signal error: kill(5482,3) failed: errno=3 No such process 4/12 12:36:08 attempt to connect to <10.16.44.233:46545> failed: Connection refu sed (connect errno = 111). 4/12 12:36:08 Send_Signal: ERROR sending signal 3 (SIGQUIT) to pid 5482 (no long er exists) 4/12 12:36:08 ERROR: failed to send SIGQUIT to pid 5482 4/12 12:36:08 Sent SIGQUIT to STARTD (pid 3849) 4/12 12:36:08 The STARTD (pid 3849) exited with status 0 4/12 12:41:08 Timeout for fast shutdown has expired for SCHEDD. 4/12 12:41:08 Sent SIGKILL to SCHEDD (pid 5482) and all its children. 4/13 09:10:35 Got SIGTERM. Performing graceful shutdown. 4/13 09:10:35 Send_Signal error: kill(5482,15) failed: errno=3 No such process 4/13 09:10:35 attempt to connect to <10.16.44.233:46545> failed: Connection refu sed (connect errno = 111). 4/13 09:10:35 Send_Signal: ERROR sending signal 15 (SIGTERM) to pid 5482 (no lon ger exists)
No reproduction possible. Attempts made even when tracing the schedd. All works as expected. 7.4.2-0.3