This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 474186 - Additional FAQ on HA Schedd issue
Additional FAQ on HA Schedd issue
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: Grid_User_Guide (Show other bugs)
1.1
All Linux
medium Severity medium
: 1.1.1
: ---
Assigned To: Lana Brindley
Jeff Needle
: Documentation
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-02 12:45 EST by William Henry
Modified: 2013-10-23 19:11 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-12-05 01:13:27 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description William Henry 2008-12-02 12:45:57 EST
Description of problem:


Description of problem:
In an High-Available Scheduler setup with 2 nodes (A & B), condor starts on
node A and brings up the schedd then condor was started on node B.  On node B,
the schedd continually attempts to start and exits with status 0. Won't failovers not occur since the schedd on node B would never start. Why is my schedd on node B continually failing?


The issue was that the the naming is probably off. I.e. two nodes with two different HA schedd names and therefore the second continually tried to come up but can't due to lock conflicts.  There is really only one schedd for HA e.g. ha-schedd@. What's happening is B's Master tries to launch a different schedd e.g. ha_schedd but it can't because of lock clash.  USing the same name for the 'one' HA schedd the master on B figures it out ... i.e. there is already a ha-schedd@ running. It will only staer the failover when the primary fails.  

Config With the bounching secondary:

A's config file entry for SCHEDD_NAME:
SCHEDD_NAME = ha-schedd@

B's config file entry for SCHEDD_NAME:
SCHEDD_NAME = ha_schedd@


Change so that both use the same name:
SCHEDD_NAME = ha-schedd@


Also this does not preclude a pool to have other schedulers on other nodes besides the HA schedd: ha-schedd@.  So you can have HA (on two nodes) and other schedds elsewhere.

This comes from:
https://bugzilla.redhat.com/show_bug.cgi?id=469765
Comment 1 Lana Brindley 2008-12-05 01:13:27 EST
	<qandaentry>
			<question>
				<para>
					I have a High Availability setup, but sometimes the <command>scheddd</command> keeps on trying to start but exits with a <parameter>status 0</parameter>. Why is this happening?
				</para>
			</question>
			<answer>
				<para>
					In an High-Available Scheduler setup with 2 nodes (Node A and Node B), Condor will start on Node A and brings up the <command>schedd</command>, before it starts on Node B. On node B, the <command>schedd</command> continually attempts to start and exits with <parameter>status 0</parameter>.
				</para>
				<para>
					This can be caused by the two nodes using different HA <command>schedd</command> names. In this case, the <command>schedd</command> on Node B will continually try to start, but will not be able to because of lock conflicts.
				</para>
				<para>
					This problem can be solved by using the same name for the <command>schedd</command> on both nodes. This will make the <command>schedd</command> on Node B realize that one is already running, and it doesn&#39;t need to start. Change the <command>SCHEDD_NAME</command> configuration entry on both nodes so that the name is identical.
				</para>
				<para>
					Note that this configuration will allow other schedulers to run on other nodes besides the HA <command>SCHEDD_NAME</command>. So you can have HA (on two nodes) and other <command>schedd</command>s elsewhere.
				</para>
			</answer>
		</qandaentry>

LKB

Note You need to log in before you can comment on or make changes to this bug.