496227 – Remote Config: HA Schedd lock period too long

Bug 496227 - Remote Config: HA Schedd lock period too long

Summary: Remote Config: HA Schedd lock period too long

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	grid
Sub Component:
Version:	1.1.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	1.2
Target Release:	---
Assignee:	Robert Rati
QA Contact:	Martin Kudlej
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	527551
TreeView+	depends on / blocked

Reported:	2009-04-17 12:40 UTC by Matthew Farrellee
Modified:	2018-10-20 03:53 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Grid bug fix C: HA_LOCK_HOLD_TIME and HA_POLL_PERIOD do not have sensible default periods. C: A number of failover problems, such as two schedd's running simultaneously, or inappropriate lock acquisition. F: HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) R: Failover works more reliably HA_LOCK_HOLD_TIME and HA_POLL_PERIOD had default values that could cause a range of problems with failover. HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required), and failover now works more reliably.
Clone Of:
Environment:
Last Closed:	2009-12-03 09:16:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2009:1633	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging and Grid Version 1.2	2009-12-03 09:15:33 UTC

Description Matthew Farrellee 2009-04-17 12:40:09 UTC

In a HA Schedd setup the condor_master acquires a lock allowing it to start a condor_schedd. That lock has two parameters that control it: 1) HA_LOCK_HOLD_TIME, which specifies how long until the lock is stale and can be acquired; and, 2) HA_POLL_PERIOD, which specifies how often the lock is refreshed or checked for staleness.

The problem is those params do not have sane defaults.

HA_LOCK_HOLD_TIME defaults to 1 hour (3600 seconds) and HA_POLL_PERIOD to 5 minutes (300 seconds).

For the fail-over to work well, the HA_LOCK_HOLD_TIME should be less than the lease duration for most jobs - meaning the schedd will fail over before starters start giving up on jobs. Also, the HA_POLL_PERIOD should be a third of the HA_LOCK_HOLD_TIME, or maybe a bit shorter.

More sane defaults for these params should be HA_LOCK_HOLD_TIME = 300 seconds and HA_POLL_PERIOD = 60 seconds, but could be pushed down even further if faster fail-over is required.

A danger here is the lock is stored on a shared file system, and a split in the file system for a period longer than the HA_LOCK_HOLD_TIME could result in multiple schedd's running, which would be bad. Also, clock skew between machines in the HA Schedd setup could cause for inappropriate lock acquisition.

Comment 1 Robert Rati 2009-09-17 20:15:02 UTC

Fixed in:
condor-remote-configuration-1.0-17

Comment 3 Martin Kudlej 2009-10-22 13:49:44 UTC

I've tried it on condor-remote-configuration-server-1.0-14 on RHEL5.4 and
condor-remote-configuration-1.0-14 on RHEL4.8 (i386 x x86_64) and there weren't
any variables named HA_LOCK_HOLD_TIME or HA_POLL_PERIOD.
I've tried it on condor-remote-configuration(-server)-1.0-22 and there are HA_LOCK_HOLD_TIME = 300 and HA_POLL_PERIOD = 60. Is this enough to verify the bug or could you describe here any testing scenario please?

Comment 4 Robert Rati 2009-10-23 20:55:13 UTC

Initiate a failover (either by shutting down the Schedd, or killing it) and a new Schedd should start within 6 minutes

Comment 5 Martin Kudlej 2009-10-27 14:32:31 UTC

Testing scenario:
1st machine RHEL 5.4 i386/x86_64, 2nd machine RHEL 4.8 x86_64/i386.
HA SCHEDD configured via condor_configure_node from RHEL5.4.
Shut down condor on 1st machine and within less than 6 minutes schedd starts on 2nd machine, so it works as we expected (condor-7.4.1-0.2). 

Tested it with condor-7.2.2-0.9 and it doesn't work.
Tested it with condor-7.4.1-0.2 and it works. -->VERIFIED

Comment 6 Irina Boverman 2009-10-28 18:07:16 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
HA Schedd lock period has been shorten: HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) (496227)

Comment 7 Lana Brindley 2009-11-05 03:45:05 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-HA Schedd lock period has been shorten: HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) (496227)+Grid bug fix
+
+C: HA_LOCK_HOLD_TIME and HA_POLL_PERIOD do not have sensible default periods.
+C: A number of failover problems, such as two schedd's running simultaneously, or inappropriate lock acquisition.
+F: HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required)
+R: Failover works more reliably
+
+HA_LOCK_HOLD_TIME and HA_POLL_PERIOD had default values that could cause a range of problems with failover. HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required), and failover now works more reliably.

Comment 8 errata-xmlrpc 2009-12-03 09:16:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html

Note You need to log in before you can comment on or make changes to this bug.