Bug 704104 - condor cannot be stopped gracefully with HAScheduler set
Summary: condor cannot be stopped gracefully with HAScheduler set
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: Development
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: 2.0
: ---
Assignee: Robert Rati
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-05-12 07:45 UTC by Lubos Trilety
Modified: 2011-06-27 14:32 UTC (History)
4 users (show)

Fixed In Version: condor-7.6.1-0.6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-06-27 14:32:57 UTC
Target Upstream Version:


Attachments (Terms of Use)
MasterLog (36.29 KB, text/plain)
2011-05-12 07:45 UTC, Lubos Trilety
no flags Details

Description Lubos Trilety 2011-05-12 07:45:59 UTC
Created attachment 498455 [details]
MasterLog

Description of problem:
I set 4 machines as HAScheduler and HACentralManager. I tried to stop service condor. All daemons stop correctly except condor_master, which never stops.

Version-Release number of selected component (if applicable):
condor-7.6.1-0.4

How reproducible:
100%

Steps to Reproduce:
1. set HACentralManager and HAScheduler on all machines
# ccp --default-group -l
Group "Internal Default Group":
Group ID: 1
Name: Internal Default Group
Features (priority: name):
  0: HAScheduler
  1: HACentralManager
  2: NodeAccess
  3: Master
  4: ExecuteNode
Parameters:
  HA_LOCK_URL = file:/exports/virt
  TRANSFER_EXECUTABLE = False
  REPLICATION_LIST = host1:$(REPLICATION_PORT),host2:$(REPLICATION_PORT),host3:$(REPLICATION_PORT),host4:$(REPLICATION_PORT)
  ALLOW_READ = *
  HAD_LIST = host1:$(HAD_PORT),host2:$(HAD_PORT),host3:$(HAD_PORT),host4:$(HAD_PORT)
  SPOOL = /exports/virt
  START = True
  SCHEDD_NAME = ha-schedd@
  ALLOW_WRITE = *
  CONDOR_HOST = host1,host2,host3,host4

2. try to stop condor
# service condor stop
Stopping Condor daemons:                                   [  OK  ]
Warning: condor_master may not have exited, start/restart may fail

3. wait a while check if condor_master is still running
# ps -eaf | grep condor | grep -v grep
condor   26531     1  0 09:10 ?        00:00:00 condor_master -f -pidfile /var/run/condor/condor_master.pid
  
Actual results:
condor_master never stops, it has to be killed

Expected results:
condor stop correctly

Additional info:
see attachment for MasterLog

Comment 2 Robert Rati 2011-05-20 13:21:35 UTC
I can reproduce with a single node with:

condor_configure_pool --default-group -l
Group "Internal Default Group":
Group ID: 1
Name: Internal Default Group
Features (priority: name):
  0: Master
  1: NodeAccess
  2: HAScheduler
Parameters:
  HA_LOCK_URL = file:///var/lib/condor/spool
  ALLOW_READ = *
  SPOOL = /var/lib/condor/spool
  SCHEDD_NAME = ha-schedd@
  ALLOW_WRITE = *
  CONDOR_HOST = $(IP_ADDRESS)

Comment 3 Robert Rati 2011-05-20 17:48:55 UTC
The master iterates though the daemon list looking for daemons that need to be shut down just before master.  If a daemon is running in an HA setup, such as the schedd, then it would erroneously think that there were daemons that needed to be shutdown despite the fact that the daemon had already exited.  This would result in the master waiting around forever for a daemon to exit that has already exited

Fixed on BZ704104-master-haschedd-hang

Comment 4 Lubos Trilety 2011-05-25 13:04:17 UTC
Tested on:
$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el5 $
$CondorPlatform: I686-RedHat_5.6 $

$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

# ccp -l --default-group
Group "Internal Default Group":
Group ID: 1
Name: Internal Default Group
Features (priority: name):
  0: HAScheduler
  1: Master
  2: NodeAccess
Parameters:
  HA_LOCK_URL = file:///var/lib/condor/spool
  ALLOW_READ = *
  SPOOL = /var/lib/condor/spool
  DO_NOTHING = n
  SCHEDD_NAME = ha-schedd@
  ALLOW_WRITE = *
  CONDOR_HOST = $(IP_ADDRESS)

# service condor stop
Stopping Condor daemons:                                   [  OK  ]
#

# ps -eaf | grep condor | grep -v grep
#

>>> VERIFIED
root      5098  9015  0 15:02 pts/0    00:00:00 grep condor


Note You need to log in before you can comment on or make changes to this bug.