In a HA Schedd setup the condor_master acquires a lock allowing it to start a condor_schedd. That lock has two parameters that control it: 1) HA_LOCK_HOLD_TIME, which specifies how long until the lock is stale and can be acquired; and, 2) HA_POLL_PERIOD, which specifies how often the lock is refreshed or checked for staleness. The problem is those params do not have sane defaults. HA_LOCK_HOLD_TIME defaults to 1 hour (3600 seconds) and HA_POLL_PERIOD to 5 minutes (300 seconds). For the fail-over to work well, the HA_LOCK_HOLD_TIME should be less than the lease duration for most jobs - meaning the schedd will fail over before starters start giving up on jobs. Also, the HA_POLL_PERIOD should be a third of the HA_LOCK_HOLD_TIME, or maybe a bit shorter. More sane defaults for these params should be HA_LOCK_HOLD_TIME = 300 seconds and HA_POLL_PERIOD = 60 seconds, but could be pushed down even further if faster fail-over is required. A danger here is the lock is stored on a shared file system, and a split in the file system for a period longer than the HA_LOCK_HOLD_TIME could result in multiple schedd's running, which would be bad. Also, clock skew between machines in the HA Schedd setup could cause for inappropriate lock acquisition.
Fixed in: condor-remote-configuration-1.0-17
I've tried it on condor-remote-configuration-server-1.0-14 on RHEL5.4 and condor-remote-configuration-1.0-14 on RHEL4.8 (i386 x x86_64) and there weren't any variables named HA_LOCK_HOLD_TIME or HA_POLL_PERIOD. I've tried it on condor-remote-configuration(-server)-1.0-22 and there are HA_LOCK_HOLD_TIME = 300 and HA_POLL_PERIOD = 60. Is this enough to verify the bug or could you describe here any testing scenario please?
Initiate a failover (either by shutting down the Schedd, or killing it) and a new Schedd should start within 6 minutes
Testing scenario: 1st machine RHEL 5.4 i386/x86_64, 2nd machine RHEL 4.8 x86_64/i386. HA SCHEDD configured via condor_configure_node from RHEL5.4. Shut down condor on 1st machine and within less than 6 minutes schedd starts on 2nd machine, so it works as we expected (condor-7.4.1-0.2). Tested it with condor-7.2.2-0.9 and it doesn't work. Tested it with condor-7.4.1-0.2 and it works. -->VERIFIED
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: HA Schedd lock period has been shorten: HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) (496227)
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,8 @@ -HA Schedd lock period has been shorten: HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) (496227)+Grid bug fix + +C: HA_LOCK_HOLD_TIME and HA_POLL_PERIOD do not have sensible default periods. +C: A number of failover problems, such as two schedd's running simultaneously, or inappropriate lock acquisition. +F: HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required) +R: Failover works more reliably + +HA_LOCK_HOLD_TIME and HA_POLL_PERIOD had default values that could cause a range of problems with failover. HA Schedd lock period has been shortened. HA_LOCK_HOLD_TIME now defaults to 300 seconds, and HA_POLL_PERIOD to 60 seconds (these parameters could be changed to lower values if faster fail-over is required), and failover now works more reliably.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1633.html