Bug 704104

Summary: condor cannot be stopped gracefully with HAScheduler set
Product: Red Hat Enterprise MRG Reporter: Lubos Trilety <ltrilety>
Component: condorAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Lubos Trilety <ltrilety>
Severity: medium Docs Contact:
Priority: medium    
Version: DevelopmentCC: iboverma, jneedle, ltoscano, matt
Target Milestone: 2.0   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: condor-7.6.1-0.6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-27 14:32:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
MasterLog none

Description Lubos Trilety 2011-05-12 07:45:59 UTC
Created attachment 498455 [details]
MasterLog

Description of problem:
I set 4 machines as HAScheduler and HACentralManager. I tried to stop service condor. All daemons stop correctly except condor_master, which never stops.

Version-Release number of selected component (if applicable):
condor-7.6.1-0.4

How reproducible:
100%

Steps to Reproduce:
1. set HACentralManager and HAScheduler on all machines
# ccp --default-group -l
Group "Internal Default Group":
Group ID: 1
Name: Internal Default Group
Features (priority: name):
  0: HAScheduler
  1: HACentralManager
  2: NodeAccess
  3: Master
  4: ExecuteNode
Parameters:
  HA_LOCK_URL = file:/exports/virt
  TRANSFER_EXECUTABLE = False
  REPLICATION_LIST = host1:$(REPLICATION_PORT),host2:$(REPLICATION_PORT),host3:$(REPLICATION_PORT),host4:$(REPLICATION_PORT)
  ALLOW_READ = *
  HAD_LIST = host1:$(HAD_PORT),host2:$(HAD_PORT),host3:$(HAD_PORT),host4:$(HAD_PORT)
  SPOOL = /exports/virt
  START = True
  SCHEDD_NAME = ha-schedd@
  ALLOW_WRITE = *
  CONDOR_HOST = host1,host2,host3,host4

2. try to stop condor
# service condor stop
Stopping Condor daemons:                                   [  OK  ]
Warning: condor_master may not have exited, start/restart may fail

3. wait a while check if condor_master is still running
# ps -eaf | grep condor | grep -v grep
condor   26531     1  0 09:10 ?        00:00:00 condor_master -f -pidfile /var/run/condor/condor_master.pid
  
Actual results:
condor_master never stops, it has to be killed

Expected results:
condor stop correctly

Additional info:
see attachment for MasterLog

Comment 2 Robert Rati 2011-05-20 13:21:35 UTC
I can reproduce with a single node with:

condor_configure_pool --default-group -l
Group "Internal Default Group":
Group ID: 1
Name: Internal Default Group
Features (priority: name):
  0: Master
  1: NodeAccess
  2: HAScheduler
Parameters:
  HA_LOCK_URL = file:///var/lib/condor/spool
  ALLOW_READ = *
  SPOOL = /var/lib/condor/spool
  SCHEDD_NAME = ha-schedd@
  ALLOW_WRITE = *
  CONDOR_HOST = $(IP_ADDRESS)

Comment 3 Robert Rati 2011-05-20 17:48:55 UTC
The master iterates though the daemon list looking for daemons that need to be shut down just before master.  If a daemon is running in an HA setup, such as the schedd, then it would erroneously think that there were daemons that needed to be shutdown despite the fact that the daemon had already exited.  This would result in the master waiting around forever for a daemon to exit that has already exited

Fixed on BZ704104-master-haschedd-hang

Comment 4 Lubos Trilety 2011-05-25 13:04:17 UTC
Tested on:
$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el5 $
$CondorPlatform: I686-RedHat_5.6 $

$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

# ccp -l --default-group
Group "Internal Default Group":
Group ID: 1
Name: Internal Default Group
Features (priority: name):
  0: HAScheduler
  1: Master
  2: NodeAccess
Parameters:
  HA_LOCK_URL = file:///var/lib/condor/spool
  ALLOW_READ = *
  SPOOL = /var/lib/condor/spool
  DO_NOTHING = n
  SCHEDD_NAME = ha-schedd@
  ALLOW_WRITE = *
  CONDOR_HOST = $(IP_ADDRESS)

# service condor stop
Stopping Condor daemons:                                   [  OK  ]
#

# ps -eaf | grep condor | grep -v grep
#

>>> VERIFIED
root      5098  9015  0 15:02 pts/0    00:00:00 grep condor