| Summary: | condor cannot be stopped gracefully with HAScheduler set | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Lubos Trilety <ltrilety> | ||||
| Component: | condor | Assignee: | Robert Rati <rrati> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Lubos Trilety <ltrilety> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | Development | CC: | iboverma, jneedle, ltoscano, matt | ||||
| Target Milestone: | 2.0 | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | condor-7.6.1-0.6 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-06-27 14:32:57 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
I can reproduce with a single node with: condor_configure_pool --default-group -l Group "Internal Default Group": Group ID: 1 Name: Internal Default Group Features (priority: name): 0: Master 1: NodeAccess 2: HAScheduler Parameters: HA_LOCK_URL = file:///var/lib/condor/spool ALLOW_READ = * SPOOL = /var/lib/condor/spool SCHEDD_NAME = ha-schedd@ ALLOW_WRITE = * CONDOR_HOST = $(IP_ADDRESS) The master iterates though the daemon list looking for daemons that need to be shut down just before master. If a daemon is running in an HA setup, such as the schedd, then it would erroneously think that there were daemons that needed to be shutdown despite the fact that the daemon had already exited. This would result in the master waiting around forever for a daemon to exit that has already exited Fixed on BZ704104-master-haschedd-hang Tested on:
$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el5 $
$CondorPlatform: I686-RedHat_5.6 $
$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $
$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $
$CondorPlatform: I686-RedHat_6.1 $
$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $
# ccp -l --default-group
Group "Internal Default Group":
Group ID: 1
Name: Internal Default Group
Features (priority: name):
0: HAScheduler
1: Master
2: NodeAccess
Parameters:
HA_LOCK_URL = file:///var/lib/condor/spool
ALLOW_READ = *
SPOOL = /var/lib/condor/spool
DO_NOTHING = n
SCHEDD_NAME = ha-schedd@
ALLOW_WRITE = *
CONDOR_HOST = $(IP_ADDRESS)
# service condor stop
Stopping Condor daemons: [ OK ]
#
# ps -eaf | grep condor | grep -v grep
#
>>> VERIFIED
root 5098 9015 0 15:02 pts/0 00:00:00 grep condor
|
Created attachment 498455 [details] MasterLog Description of problem: I set 4 machines as HAScheduler and HACentralManager. I tried to stop service condor. All daemons stop correctly except condor_master, which never stops. Version-Release number of selected component (if applicable): condor-7.6.1-0.4 How reproducible: 100% Steps to Reproduce: 1. set HACentralManager and HAScheduler on all machines # ccp --default-group -l Group "Internal Default Group": Group ID: 1 Name: Internal Default Group Features (priority: name): 0: HAScheduler 1: HACentralManager 2: NodeAccess 3: Master 4: ExecuteNode Parameters: HA_LOCK_URL = file:/exports/virt TRANSFER_EXECUTABLE = False REPLICATION_LIST = host1:$(REPLICATION_PORT),host2:$(REPLICATION_PORT),host3:$(REPLICATION_PORT),host4:$(REPLICATION_PORT) ALLOW_READ = * HAD_LIST = host1:$(HAD_PORT),host2:$(HAD_PORT),host3:$(HAD_PORT),host4:$(HAD_PORT) SPOOL = /exports/virt START = True SCHEDD_NAME = ha-schedd@ ALLOW_WRITE = * CONDOR_HOST = host1,host2,host3,host4 2. try to stop condor # service condor stop Stopping Condor daemons: [ OK ] Warning: condor_master may not have exited, start/restart may fail 3. wait a while check if condor_master is still running # ps -eaf | grep condor | grep -v grep condor 26531 1 0 09:10 ? 00:00:00 condor_master -f -pidfile /var/run/condor/condor_master.pid Actual results: condor_master never stops, it has to be killed Expected results: condor stop correctly Additional info: see attachment for MasterLog