Bug 496227 - Remote Config: HA Schedd lock period too long
Remote Config: HA Schedd lock period too long
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: grid (Show other bugs)
1.1.1
All Linux
medium Severity medium
: 1.2
: ---
Assigned To: Robert Rati
Martin Kudlej
:
Depends On:
Blocks: 527551
  Show dependency treegraph
 
Reported: 2009-04-17 08:40 EDT by Matthew Farrellee
Modified: 2010-10-23 05:03 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Grid bug fix C: HA_LOCK_HOLD_TIME and HA_POLL_PERIOD do not have sensible default periods. C: A number of failover problems, such as two schedd's running simultaneously, or inappropriate lock acquisition. F: HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) R: Failover works more reliably HA_LOCK_HOLD_TIME and HA_POLL_PERIOD had default values that could cause a range of problems with failover. HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required), and failover now works more reliably.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-12-03 04:16:12 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Matthew Farrellee 2009-04-17 08:40:09 EDT
In a HA Schedd setup the condor_master acquires a lock allowing it to start a condor_schedd. That lock has two parameters that control it: 1) HA_LOCK_HOLD_TIME, which specifies how long until the lock is stale and can be acquired; and, 2) HA_POLL_PERIOD, which specifies how often the lock is refreshed or checked for staleness.

The problem is those params do not have sane defaults.

HA_LOCK_HOLD_TIME defaults to 1 hour (3600 seconds) and HA_POLL_PERIOD to 5 minutes (300 seconds).

For the fail-over to work well, the HA_LOCK_HOLD_TIME should be less than the lease duration for most jobs - meaning the schedd will fail over before starters start giving up on jobs. Also, the HA_POLL_PERIOD should be a third of the HA_LOCK_HOLD_TIME, or maybe a bit shorter.

More sane defaults for these params should be HA_LOCK_HOLD_TIME = 300 seconds and HA_POLL_PERIOD = 60 seconds, but could be pushed down even further if faster fail-over is required.

A danger here is the lock is stored on a shared file system, and a split in the file system for a period longer than the HA_LOCK_HOLD_TIME could result in multiple schedd's running, which would be bad. Also, clock skew between machines in the HA Schedd setup could cause for inappropriate lock acquisition.
Comment 1 Robert Rati 2009-09-17 16:15:02 EDT
Fixed in:
condor-remote-configuration-1.0-17
Comment 3 Martin Kudlej 2009-10-22 09:49:44 EDT
I've tried it on condor-remote-configuration-server-1.0-14 on RHEL5.4 and
condor-remote-configuration-1.0-14 on RHEL4.8 (i386 x x86_64) and there weren't
any variables named HA_LOCK_HOLD_TIME or HA_POLL_PERIOD.
I've tried it on condor-remote-configuration(-server)-1.0-22 and there are HA_LOCK_HOLD_TIME = 300 and HA_POLL_PERIOD = 60. Is this enough to verify the bug or could you describe here any testing scenario please?
Comment 4 Robert Rati 2009-10-23 16:55:13 EDT
Initiate a failover (either by shutting down the Schedd, or killing it) and a new Schedd should start within 6 minutes
Comment 5 Martin Kudlej 2009-10-27 10:32:31 EDT
Testing scenario:
1st machine RHEL 5.4 i386/x86_64, 2nd machine RHEL 4.8 x86_64/i386.
HA SCHEDD configured via condor_configure_node from RHEL5.4.
Shut down condor on 1st machine and within less than 6 minutes schedd starts on 2nd machine, so it works as we expected (condor-7.4.1-0.2). 

Tested it with condor-7.2.2-0.9 and it doesn't work.
Tested it with condor-7.4.1-0.2 and it works. -->VERIFIED
Comment 6 Irina Boverman 2009-10-28 14:07:16 EDT
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
HA Schedd lock period has been shorten: HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) (496227)
Comment 7 Lana Brindley 2009-11-04 22:45:05 EST
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-HA Schedd lock period has been shorten: HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) (496227)+Grid bug fix
+
+C: HA_LOCK_HOLD_TIME and HA_POLL_PERIOD do not have sensible default periods.
+C: A number of failover problems, such as two schedd's running simultaneously, or inappropriate lock acquisition.
+F: HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required)
+R: Failover works more reliably
+
+HA_LOCK_HOLD_TIME and HA_POLL_PERIOD had default values that could cause a range of problems with failover. HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required), and failover now works more reliably.
Comment 8 errata-xmlrpc 2009-12-03 04:16:12 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html

Note You need to log in before you can comment on or make changes to this bug.