Description of problem: The high availability condor_schedd system relies upon a condor_master process updating a timestamp on a file in a shared filesystem. If the delta between current time and that timestamp exceeds some threshold, a secondary condor_master fires up additional condor_schedd processes. There is nothing in place to fence / kill the original set of condor_schedd processes. Consider the following test: 1. Schedd HA node #1 running with $(SPOOL) on NFS hard mount. 2. NFS server hangs for time delta greater than secondary condor_master's configured tolerance. 3. Secondary condor_master starts up duplicate set of condor_schedd's. 4. NFS server quits hanging. 5. There is neither a means to stop the original condor_schedd processes at this point, nor a means to stop the secondary condor_schedd processes. They are both concurrently (over)writing $(SPOOL)/job_queue.log at this point. Something needs to be added to fence the condor_schedd processes on the origin node or somesuch. How reproducible: 100%. An event similar to the one above happened during our preventative maintenance period yesterday. Steps to Reproduce: See above. Actual results: See above. Expected results: Only one condor_schedd process running per configured $(SPOOL).
Locating the lock on the same mount as the job_queue.log should avoid any overwriting. Theoretically (needs verification), hard mount semantics may not prevent a race between Masters. t0 - MasterOne obtains lock t1 - MasterTwo fails to obtain lock t2 - NFS server fails t3 - MasterTwo tries to obtain lock, MasterOne tries to update lock, both block t4 - NFS server returns, after lock has expired t5 - MasterTwo unblocks and obtains lock t6 - MasterOne updates lock At t6 both think they own the lock. Introduction of an identifier that allows MasterOne to notice it lost the lock will improve the situation. However, even then, between t5 and t6 there may be multiple copies of the managed daemon running. An acked fence event from MasterTwo to MasterOne before MasterTwo starts the daemon would address this issue. Aside, fencing is not desirable in the case where multiple managed daemons exist on a single node. An NFS failure for one daemon would trigger the fencing of all daemons.
Using Red Hat Cluster Suite to manage the HA Schedd will address this issue