Bug 844971

Summary: Central Manager High Availability - the negotiator doesn't run on any node
Product: Red Hat Enterprise MRG Reporter: Lubos Trilety <ltrilety>
Component: condorAssignee: Robert Rati <rrati>
Status: CLOSED WORKSFORME QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: 2.2CC: iboverma, matt, mkudlej, rrati, tstclair
Target Milestone: 2.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-10-10 15:00:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
condor logs none

Description Lubos Trilety 2012-08-01 11:37:07 UTC
Created attachment 601716 [details]
condor logs

Description of problem:
If the high availability of central manager was configured using wallaby and the negotiator was stopped using condor_off command after it had started, the HAD didn't stop which led to situation when no negotiator has been running in pool.

Version-Release number of selected component (if applicable):
condor-7.6.5-0.19.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Configure pool
$ wallaby add-group allnodes
Adding the following group: allnodes
Console Connection Established...

#Repeat following command for all machines
$ wallaby add-node-memberships <node> allnodes
Console Connection Established...

$ wallaby add-params-to-group allnodes ALLOW_WRITE=* ALLOW_READ=* ALL_DEBUG=D_FULLDEBUG REPLICATION_LIST='<node1>:$(REPLICATION_PORT),<node2>:$(REPLICATION_PORT)' HAD_LIST='<node1>:$(HAD_PORT),<node2>:$(HAD_PORT)' HAD_UPDATE_INTERVAL=30 REPLICATION_INTERVAL=30 CONDOR_HOST='<node1>, <node2>'
Console Connection Established...

$ wallaby add-features-to-group allnodes NodeAccess Master HACentralManager
Console Connection Established...

$ wallaby activate
Console Connection Established...

2. wait until negotiator is running and then run "condor_off -subsystem negotiator (-fast)"

3. check condor daemons
  
Actual results:
only negotiator stop, HAD is still running on the master node (<node1>)

Expected results:
both HAD and negotiator should stop

Additional info:
After restart of the condor on all machines it works correctly
After some checking I found out, that it happens only on rhel6 64bit machine, there are logs from that machine in attachment

Comment 1 Robert Rati 2012-10-10 15:00:23 UTC
I've been unable to reproduce this issue with 3 and 4 node pools, including the original node that saw the problem.