Bug 496227

Summary:	Remote Config: HA Schedd lock period too long
Product:	Red Hat Enterprise MRG	Reporter:	Matthew Farrellee <matt>
Component:	grid	Assignee:	Robert Rati <rrati>
Status:	CLOSED ERRATA	QA Contact:	Martin Kudlej <mkudlej>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	1.1.1	CC:	iboverma, lans.carstensen, lbrindle, mkudlej, tao
Target Milestone:	1.2
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Grid bug fix C: HA_LOCK_HOLD_TIME and HA_POLL_PERIOD do not have sensible default periods. C: A number of failover problems, such as two schedd's running simultaneously, or inappropriate lock acquisition. F: HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) R: Failover works more reliably HA_LOCK_HOLD_TIME and HA_POLL_PERIOD had default values that could cause a range of problems with failover. HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required), and failover now works more reliably.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-12-03 09:16:12 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	527551

Description Matthew Farrellee 2009-04-17 12:40:09 UTC

In a HA Schedd setup the condor_master acquires a lock allowing it to start a condor_schedd. That lock has two parameters that control it: 1) HA_LOCK_HOLD_TIME, which specifies how long until the lock is stale and can be acquired; and, 2) HA_POLL_PERIOD, which specifies how often the lock is refreshed or checked for staleness.

The problem is those params do not have sane defaults.

HA_LOCK_HOLD_TIME defaults to 1 hour (3600 seconds) and HA_POLL_PERIOD to 5 minutes (300 seconds).

For the fail-over to work well, the HA_LOCK_HOLD_TIME should be less than the lease duration for most jobs - meaning the schedd will fail over before starters start giving up on jobs. Also, the HA_POLL_PERIOD should be a third of the HA_LOCK_HOLD_TIME, or maybe a bit shorter.

More sane defaults for these params should be HA_LOCK_HOLD_TIME = 300 seconds and HA_POLL_PERIOD = 60 seconds, but could be pushed down even further if faster fail-over is required.

A danger here is the lock is stored on a shared file system, and a split in the file system for a period longer than the HA_LOCK_HOLD_TIME could result in multiple schedd's running, which would be bad. Also, clock skew between machines in the HA Schedd setup could cause for inappropriate lock acquisition.

Comment 1 Robert Rati 2009-09-17 20:15:02 UTC

Fixed in:
condor-remote-configuration-1.0-17

Comment 3 Martin Kudlej 2009-10-22 13:49:44 UTC

I've tried it on condor-remote-configuration-server-1.0-14 on RHEL5.4 and
condor-remote-configuration-1.0-14 on RHEL4.8 (i386 x x86_64) and there weren't
any variables named HA_LOCK_HOLD_TIME or HA_POLL_PERIOD.
I've tried it on condor-remote-configuration(-server)-1.0-22 and there are HA_LOCK_HOLD_TIME = 300 and HA_POLL_PERIOD = 60. Is this enough to verify the bug or could you describe here any testing scenario please?

Comment 4 Robert Rati 2009-10-23 20:55:13 UTC

Initiate a failover (either by shutting down the Schedd, or killing it) and a new Schedd should start within 6 minutes

Comment 5 Martin Kudlej 2009-10-27 14:32:31 UTC

Testing scenario:
1st machine RHEL 5.4 i386/x86_64, 2nd machine RHEL 4.8 x86_64/i386.
HA SCHEDD configured via condor_configure_node from RHEL5.4.
Shut down condor on 1st machine and within less than 6 minutes schedd starts on 2nd machine, so it works as we expected (condor-7.4.1-0.2). 

Tested it with condor-7.2.2-0.9 and it doesn't work.
Tested it with condor-7.4.1-0.2 and it works. -->VERIFIED

Comment 6 Irina Boverman 2009-10-28 18:07:16 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
HA Schedd lock period has been shorten: HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) (496227)

Comment 7 Lana Brindley 2009-11-05 03:45:05 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-HA Schedd lock period has been shorten: HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) (496227)+Grid bug fix
+
+C: HA_LOCK_HOLD_TIME and HA_POLL_PERIOD do not have sensible default periods.
+C: A number of failover problems, such as two schedd's running simultaneously, or inappropriate lock acquisition.
+F: HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required)
+R: Failover works more reliably
+
+HA_LOCK_HOLD_TIME and HA_POLL_PERIOD had default values that could cause a range of problems with failover. HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required), and failover now works more reliably.

Comment 8 errata-xmlrpc 2009-12-03 09:16:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html