Red Hat Bugzilla – Bug 864637
'condor_restart -subsystem had' causes had and negotiator to shutdown
Last modified: 2013-01-14 14:40:50 EST
Description of problem:
Sending a condor_restart to the had subsystem in an HACM setup seems to cause the had, replication, and negotiator to shutdown and not get restarted.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Setup HACM with wallaby
2. On node running the negotiator execute: 'condor_restart -subsystem had'
3. Notice the had, replication, and negotiator daemons are stopped and not restarted
The replication actually isn't going down. Just the had and negotiator.
I tested that on condor-7.8.8-0.1, it still takes about 6 minutes till the negotiator and HAD starts again.
# cat NegotiatorLog
01/09/13 10:38:37 **** condor_negotiator (condor_NEGOTIATOR) pid 24179 EXITING WITH STATUS 0
01/09/13 10:44:54 OpSysMajorVersion: 6
# cat HADLog
01/09/13 10:38:33 **** condor_had (condor_HAD) pid 24092 EXITING WITH STATUS 0
01/09/13 10:44:34 OpSysMajorVersion: 6
The issue was the the negotiator and had went down and never came back up. Since the had and negotiator daemons are restarting, it seems like things are working as expected. I suspect the reason it is taking ~6 minutes for the had to restart is because MASTER_HAD_BACKOFF_CONSTANT = 360 (6 minutes) in the HACentralManager configuration.
Unable to reproduce the original issue.