Bug 766629

Summary: RH HA + HA Scheduler - 2 schedulers in pool after node with scheduler died
Product: Red Hat Enterprise MRG Reporter: Martin Kudlej <mkudlej>
Component: condorAssignee: grid-maint-list <grid-maint-list>
Status: CLOSED WONTFIX QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 2.1CC: matt, trusnak, tstclair
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-26 20:23:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
configuration of cluster nodes none

Description Martin Kudlej 2011-12-12 13:42:31 UTC
Created attachment 545715 [details]
configuration of cluster nodes

Description of problem:
There are 2(one live and one dead) scheduler in the pool after killing of node with running scheduler.

Version-Release number of selected component (if applicable):
condor-7.6.5-0.8.el6

How reproducible:
100%

Steps to Reproduce:
1. setup HA scheduler with RH HA so scheduler runs on one node of cluster

2. do this for node with scheduler: virsh destroy(for cluster with nodes in KVM virtualization) or
iptables --flush  
iptables -P INPUT DROP  
iptables -P FORWARD DROP  
iptables -P OUTPUT DROP 
for cluster with physical machines
In this case is this run for _202_ node.

3. cluster moves service to another node

4. check how many schedulers are in pool by 

$ condor_status -schedd -l | grep -i machine
Machine = "_202_" <- this machine has benn destroyed and should not been there
Machine = "_205_" <- this is current running scheduler

$ condor_q -name ha_schedd@


-- Schedd: ha_schedd@ : <_205_:52784>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held
<waiting for timeout of query on dead _202_ node>
...
-- Failed to fetch ads from: <_202_:54520> : _202_
CEDAR:6001:Failed to connect to <_202_:54520>
  
Actual results:
There are dead scheduler in pool in condor_status outputs and condor_q also work with it.

Expected results:
There is just one live scheduler in pool.

Additional info: This testcase works for configuration of HA scheduler without usage of RH HA.

Comment 1 Anne-Louise Tangring 2016-05-26 20:23:33 UTC
MRG-Grid is in maintenance and only customer escalations will be considered. This issue can be reopened if a customer escalation associated with it occurs.