Bug 496227

Summary: Remote Config: HA Schedd lock period too long
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: gridAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Martin Kudlej <mkudlej>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.1.1CC: iboverma, lans.carstensen, lbrindle, mkudlej, tao
Target Milestone: 1.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Grid bug fix C: HA_LOCK_HOLD_TIME and HA_POLL_PERIOD do not have sensible default periods. C: A number of failover problems, such as two schedd's running simultaneously, or inappropriate lock acquisition. F: HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) R: Failover works more reliably HA_LOCK_HOLD_TIME and HA_POLL_PERIOD had default values that could cause a range of problems with failover. HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required), and failover now works more reliably.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-12-03 09:16:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 527551    

Description Matthew Farrellee 2009-04-17 12:40:09 UTC
In a HA Schedd setup the condor_master acquires a lock allowing it to start a condor_schedd. That lock has two parameters that control it: 1) HA_LOCK_HOLD_TIME, which specifies how long until the lock is stale and can be acquired; and, 2) HA_POLL_PERIOD, which specifies how often the lock is refreshed or checked for staleness.

The problem is those params do not have sane defaults.

HA_LOCK_HOLD_TIME defaults to 1 hour (3600 seconds) and HA_POLL_PERIOD to 5 minutes (300 seconds).

For the fail-over to work well, the HA_LOCK_HOLD_TIME should be less than the lease duration for most jobs - meaning the schedd will fail over before starters start giving up on jobs. Also, the HA_POLL_PERIOD should be a third of the HA_LOCK_HOLD_TIME, or maybe a bit shorter.

More sane defaults for these params should be HA_LOCK_HOLD_TIME = 300 seconds and HA_POLL_PERIOD = 60 seconds, but could be pushed down even further if faster fail-over is required.

A danger here is the lock is stored on a shared file system, and a split in the file system for a period longer than the HA_LOCK_HOLD_TIME could result in multiple schedd's running, which would be bad. Also, clock skew between machines in the HA Schedd setup could cause for inappropriate lock acquisition.

Comment 1 Robert Rati 2009-09-17 20:15:02 UTC
Fixed in:
condor-remote-configuration-1.0-17

Comment 3 Martin Kudlej 2009-10-22 13:49:44 UTC
I've tried it on condor-remote-configuration-server-1.0-14 on RHEL5.4 and
condor-remote-configuration-1.0-14 on RHEL4.8 (i386 x x86_64) and there weren't
any variables named HA_LOCK_HOLD_TIME or HA_POLL_PERIOD.
I've tried it on condor-remote-configuration(-server)-1.0-22 and there are HA_LOCK_HOLD_TIME = 300 and HA_POLL_PERIOD = 60. Is this enough to verify the bug or could you describe here any testing scenario please?

Comment 4 Robert Rati 2009-10-23 20:55:13 UTC
Initiate a failover (either by shutting down the Schedd, or killing it) and a new Schedd should start within 6 minutes

Comment 5 Martin Kudlej 2009-10-27 14:32:31 UTC
Testing scenario:
1st machine RHEL 5.4 i386/x86_64, 2nd machine RHEL 4.8 x86_64/i386.
HA SCHEDD configured via condor_configure_node from RHEL5.4.
Shut down condor on 1st machine and within less than 6 minutes schedd starts on 2nd machine, so it works as we expected (condor-7.4.1-0.2). 

Tested it with condor-7.2.2-0.9 and it doesn't work.
Tested it with condor-7.4.1-0.2 and it works. -->VERIFIED

Comment 6 Irina Boverman 2009-10-28 18:07:16 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
HA Schedd lock period has been shorten: HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) (496227)

Comment 7 Lana Brindley 2009-11-05 03:45:05 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-HA Schedd lock period has been shorten: HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) (496227)+Grid bug fix
+
+C: HA_LOCK_HOLD_TIME and HA_POLL_PERIOD do not have sensible default periods.
+C: A number of failover problems, such as two schedd's running simultaneously, or inappropriate lock acquisition.
+F: HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required)
+R: Failover works more reliably
+
+HA_LOCK_HOLD_TIME and HA_POLL_PERIOD had default values that could cause a range of problems with failover. HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required), and failover now works more reliably.

Comment 8 errata-xmlrpc 2009-12-03 09:16:12 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html