Hide Forgot
Created attachment 498455 [details] MasterLog Description of problem: I set 4 machines as HAScheduler and HACentralManager. I tried to stop service condor. All daemons stop correctly except condor_master, which never stops. Version-Release number of selected component (if applicable): condor-7.6.1-0.4 How reproducible: 100% Steps to Reproduce: 1. set HACentralManager and HAScheduler on all machines # ccp --default-group -l Group "Internal Default Group": Group ID: 1 Name: Internal Default Group Features (priority: name): 0: HAScheduler 1: HACentralManager 2: NodeAccess 3: Master 4: ExecuteNode Parameters: HA_LOCK_URL = file:/exports/virt TRANSFER_EXECUTABLE = False REPLICATION_LIST = host1:$(REPLICATION_PORT),host2:$(REPLICATION_PORT),host3:$(REPLICATION_PORT),host4:$(REPLICATION_PORT) ALLOW_READ = * HAD_LIST = host1:$(HAD_PORT),host2:$(HAD_PORT),host3:$(HAD_PORT),host4:$(HAD_PORT) SPOOL = /exports/virt START = True SCHEDD_NAME = ha-schedd@ ALLOW_WRITE = * CONDOR_HOST = host1,host2,host3,host4 2. try to stop condor # service condor stop Stopping Condor daemons: [ OK ] Warning: condor_master may not have exited, start/restart may fail 3. wait a while check if condor_master is still running # ps -eaf | grep condor | grep -v grep condor 26531 1 0 09:10 ? 00:00:00 condor_master -f -pidfile /var/run/condor/condor_master.pid Actual results: condor_master never stops, it has to be killed Expected results: condor stop correctly Additional info: see attachment for MasterLog
I can reproduce with a single node with: condor_configure_pool --default-group -l Group "Internal Default Group": Group ID: 1 Name: Internal Default Group Features (priority: name): 0: Master 1: NodeAccess 2: HAScheduler Parameters: HA_LOCK_URL = file:///var/lib/condor/spool ALLOW_READ = * SPOOL = /var/lib/condor/spool SCHEDD_NAME = ha-schedd@ ALLOW_WRITE = * CONDOR_HOST = $(IP_ADDRESS)
The master iterates though the daemon list looking for daemons that need to be shut down just before master. If a daemon is running in an HA setup, such as the schedd, then it would erroneously think that there were daemons that needed to be shutdown despite the fact that the daemon had already exited. This would result in the master waiting around forever for a daemon to exit that has already exited Fixed on BZ704104-master-haschedd-hang
Tested on: $CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el5 $ $CondorPlatform: I686-RedHat_5.6 $ $CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el5 $ $CondorPlatform: X86_64-RedHat_5.6 $ $CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $ $CondorPlatform: I686-RedHat_6.1 $ $CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $ $CondorPlatform: X86_64-RedHat_6.1 $ # ccp -l --default-group Group "Internal Default Group": Group ID: 1 Name: Internal Default Group Features (priority: name): 0: HAScheduler 1: Master 2: NodeAccess Parameters: HA_LOCK_URL = file:///var/lib/condor/spool ALLOW_READ = * SPOOL = /var/lib/condor/spool DO_NOTHING = n SCHEDD_NAME = ha-schedd@ ALLOW_WRITE = * CONDOR_HOST = $(IP_ADDRESS) # service condor stop Stopping Condor daemons: [ OK ] # # ps -eaf | grep condor | grep -v grep # >>> VERIFIED root 5098 9015 0 15:02 pts/0 00:00:00 grep condor