474186 – Additional FAQ on HA Schedd issue

Bug 474186 - Additional FAQ on HA Schedd issue

Summary: Additional FAQ on HA Schedd issue

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	Grid_User_Guide
Sub Component:
Version:	1.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	1.1.1
Target Release:	---
Assignee:	Lana Brindley
QA Contact:	Jeff Needle
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-12-02 17:45 UTC by William Henry
Modified:	2013-10-23 23:11 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-12-05 06:13:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description William Henry 2008-12-02 17:45:57 UTC

Description of problem:


Description of problem:
In an High-Available Scheduler setup with 2 nodes (A & B), condor starts on
node A and brings up the schedd then condor was started on node B.  On node B,
the schedd continually attempts to start and exits with status 0. Won't failovers not occur since the schedd on node B would never start. Why is my schedd on node B continually failing?


The issue was that the the naming is probably off. I.e. two nodes with two different HA schedd names and therefore the second continually tried to come up but can't due to lock conflicts.  There is really only one schedd for HA e.g. ha-schedd@. What's happening is B's Master tries to launch a different schedd e.g. ha_schedd but it can't because of lock clash.  USing the same name for the 'one' HA schedd the master on B figures it out ... i.e. there is already a ha-schedd@ running. It will only staer the failover when the primary fails.  

Config With the bounching secondary:

A's config file entry for SCHEDD_NAME:
SCHEDD_NAME = ha-schedd@

B's config file entry for SCHEDD_NAME:
SCHEDD_NAME = ha_schedd@


Change so that both use the same name:
SCHEDD_NAME = ha-schedd@


Also this does not preclude a pool to have other schedulers on other nodes besides the HA schedd: ha-schedd@.  So you can have HA (on two nodes) and other schedds elsewhere.

This comes from:
https://bugzilla.redhat.com/show_bug.cgi?id=469765

Comment 1 Lana Brindley 2008-12-05 06:13:27 UTC

	<qandaentry>
			<question>
				<para>
					I have a High Availability setup, but sometimes the <command>scheddd</command> keeps on trying to start but exits with a <parameter>status 0</parameter>. Why is this happening?
				</para>
			</question>
			<answer>
				<para>
					In an High-Available Scheduler setup with 2 nodes (Node A and Node B), Condor will start on Node A and brings up the <command>schedd</command>, before it starts on Node B. On node B, the <command>schedd</command> continually attempts to start and exits with <parameter>status 0</parameter>.
				</para>
				<para>
					This can be caused by the two nodes using different HA <command>schedd</command> names. In this case, the <command>schedd</command> on Node B will continually try to start, but will not be able to because of lock conflicts.
				</para>
				<para>
					This problem can be solved by using the same name for the <command>schedd</command> on both nodes. This will make the <command>schedd</command> on Node B realize that one is already running, and it doesn&#39;t need to start. Change the <command>SCHEDD_NAME</command> configuration entry on both nodes so that the name is identical.
				</para>
				<para>
					Note that this configuration will allow other schedulers to run on other nodes besides the HA <command>SCHEDD_NAME</command>. So you can have HA (on two nodes) and other <command>schedd</command>s elsewhere.
				</para>
			</answer>
		</qandaentry>

LKB

Note You need to log in before you can comment on or make changes to this bug.