Bug 474186

Summary: Additional FAQ on HA Schedd issue
Product: Red Hat Enterprise MRG Reporter: William Henry <whenry>
Component: Grid_User_GuideAssignee: Lana Brindley <lbrindle>
Status: CLOSED CURRENTRELEASE QA Contact: Jeff Needle <jneedle>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.1CC: mhideo
Target Milestone: 1.1.1Keywords: Documentation
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-12-05 06:13:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description William Henry 2008-12-02 17:45:57 UTC
Description of problem:


Description of problem:
In an High-Available Scheduler setup with 2 nodes (A & B), condor starts on
node A and brings up the schedd then condor was started on node B.  On node B,
the schedd continually attempts to start and exits with status 0. Won't failovers not occur since the schedd on node B would never start. Why is my schedd on node B continually failing?


The issue was that the the naming is probably off. I.e. two nodes with two different HA schedd names and therefore the second continually tried to come up but can't due to lock conflicts.  There is really only one schedd for HA e.g. ha-schedd@. What's happening is B's Master tries to launch a different schedd e.g. ha_schedd but it can't because of lock clash.  USing the same name for the 'one' HA schedd the master on B figures it out ... i.e. there is already a ha-schedd@ running. It will only staer the failover when the primary fails.  

Config With the bounching secondary:

A's config file entry for SCHEDD_NAME:
SCHEDD_NAME = ha-schedd@

B's config file entry for SCHEDD_NAME:
SCHEDD_NAME = ha_schedd@


Change so that both use the same name:
SCHEDD_NAME = ha-schedd@


Also this does not preclude a pool to have other schedulers on other nodes besides the HA schedd: ha-schedd@.  So you can have HA (on two nodes) and other schedds elsewhere.

This comes from:
https://bugzilla.redhat.com/show_bug.cgi?id=469765

Comment 1 Lana Brindley 2008-12-05 06:13:27 UTC
	<qandaentry>
			<question>
				<para>
					I have a High Availability setup, but sometimes the <command>scheddd</command> keeps on trying to start but exits with a <parameter>status 0</parameter>. Why is this happening?
				</para>
			</question>
			<answer>
				<para>
					In an High-Available Scheduler setup with 2 nodes (Node A and Node B), Condor will start on Node A and brings up the <command>schedd</command>, before it starts on Node B. On node B, the <command>schedd</command> continually attempts to start and exits with <parameter>status 0</parameter>.
				</para>
				<para>
					This can be caused by the two nodes using different HA <command>schedd</command> names. In this case, the <command>schedd</command> on Node B will continually try to start, but will not be able to because of lock conflicts.
				</para>
				<para>
					This problem can be solved by using the same name for the <command>schedd</command> on both nodes. This will make the <command>schedd</command> on Node B realize that one is already running, and it doesn&#39;t need to start. Change the <command>SCHEDD_NAME</command> configuration entry on both nodes so that the name is identical.
				</para>
				<para>
					Note that this configuration will allow other schedulers to run on other nodes besides the HA <command>SCHEDD_NAME</command>. So you can have HA (on two nodes) and other <command>schedd</command>s elsewhere.
				</para>
			</answer>
		</qandaentry>

LKB