Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 704104

Summary:

condor cannot be stopped gracefully with HAScheduler set

Product:

Red Hat Enterprise MRG

Reporter:

Lubos Trilety <ltrilety>

Component:

condor

Assignee:

Robert Rati <rrati>

Status:

CLOSED ERRATA

QA Contact:

Lubos Trilety <ltrilety>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

Development

CC:

iboverma, jneedle, ltoscano, matt

Target Milestone:

2.0

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

condor-7.6.1-0.6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-06-27 14:32:57 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
MasterLog	none

Description Lubos Trilety 2011-05-12 07:45:59 UTC

Created attachment 498455 [details]
MasterLog

Description of problem:
I set 4 machines as HAScheduler and HACentralManager. I tried to stop service condor. All daemons stop correctly except condor_master, which never stops.

Version-Release number of selected component (if applicable):
condor-7.6.1-0.4

How reproducible:
100%

Steps to Reproduce:
1. set HACentralManager and HAScheduler on all machines
# ccp --default-group -l
Group "Internal Default Group":
Group ID: 1
Name: Internal Default Group
Features (priority: name):
  0: HAScheduler
  1: HACentralManager
  2: NodeAccess
  3: Master
  4: ExecuteNode
Parameters:
  HA_LOCK_URL = file:/exports/virt
  TRANSFER_EXECUTABLE = False
  REPLICATION_LIST = host1:$(REPLICATION_PORT),host2:$(REPLICATION_PORT),host3:$(REPLICATION_PORT),host4:$(REPLICATION_PORT)
  ALLOW_READ = *
  HAD_LIST = host1:$(HAD_PORT),host2:$(HAD_PORT),host3:$(HAD_PORT),host4:$(HAD_PORT)
  SPOOL = /exports/virt
  START = True
  SCHEDD_NAME = ha-schedd@
  ALLOW_WRITE = *
  CONDOR_HOST = host1,host2,host3,host4

2. try to stop condor
# service condor stop
Stopping Condor daemons:                                   [  OK  ]
Warning: condor_master may not have exited, start/restart may fail

3. wait a while check if condor_master is still running
# ps -eaf | grep condor | grep -v grep
condor   26531     1  0 09:10 ?        00:00:00 condor_master -f -pidfile /var/run/condor/condor_master.pid
  
Actual results:
condor_master never stops, it has to be killed

Expected results:
condor stop correctly

Additional info:
see attachment for MasterLog

Comment 2 Robert Rati 2011-05-20 13:21:35 UTC

I can reproduce with a single node with:

condor_configure_pool --default-group -l
Group "Internal Default Group":
Group ID: 1
Name: Internal Default Group
Features (priority: name):
  0: Master
  1: NodeAccess
  2: HAScheduler
Parameters:
  HA_LOCK_URL = file:///var/lib/condor/spool
  ALLOW_READ = *
  SPOOL = /var/lib/condor/spool
  SCHEDD_NAME = ha-schedd@
  ALLOW_WRITE = *
  CONDOR_HOST = $(IP_ADDRESS)

Comment 3 Robert Rati 2011-05-20 17:48:55 UTC

The master iterates though the daemon list looking for daemons that need to be shut down just before master.  If a daemon is running in an HA setup, such as the schedd, then it would erroneously think that there were daemons that needed to be shutdown despite the fact that the daemon had already exited.  This would result in the master waiting around forever for a daemon to exit that has already exited

Fixed on BZ704104-master-haschedd-hang

Comment 4 Lubos Trilety 2011-05-25 13:04:17 UTC

Tested on:
$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el5 $
$CondorPlatform: I686-RedHat_5.6 $

$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el5 $
$CondorPlatform: X86_64-RedHat_5.6 $

$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $
$CondorPlatform: I686-RedHat_6.1 $

$CondorVersion: 7.6.1 May 23 2011 BuildID: RH-7.6.1-0.6.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $

# ccp -l --default-group
Group "Internal Default Group":
Group ID: 1
Name: Internal Default Group
Features (priority: name):
  0: HAScheduler
  1: Master
  2: NodeAccess
Parameters:
  HA_LOCK_URL = file:///var/lib/condor/spool
  ALLOW_READ = *
  SPOOL = /var/lib/condor/spool
  DO_NOTHING = n
  SCHEDD_NAME = ha-schedd@
  ALLOW_WRITE = *
  CONDOR_HOST = $(IP_ADDRESS)

# service condor stop
Stopping Condor daemons:                                   [  OK  ]
#

# ps -eaf | grep condor | grep -v grep
#

>>> VERIFIED
root      5098  9015  0 15:02 pts/0    00:00:00 grep condor