Bug 486487

Summary: Stale .schedd_address and .schedd_classad
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Martin Kudlej <mkudlej>
Severity: medium Docs Contact:
Priority: low    
Version: 1.1CC: lans.carstensen, lbrindle, mkudlej
Target Milestone: 1.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Grid bug fix C: Stale .schedd_address and .schedd_classad files were being left on a host when the schedd failed in a high availability cluster C: When condor_q was run, it would fail to connect to the schedd, because it was checking the stale files first. F: The log files are now stored in SPOOL instead of LOG R: Multiple machines in a pool can now read the files, and stale files no longer cause a problem. Stale .schedd_address and .schedd_classad files were being left on a host when the schedd failed in a high availability cluster. This caused condor_q to fail to connect to the schedd. The log files were moved from LOG to SPOOL, which allows multiple machines in a pool to read the files, and stale files no longer cause a problem.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-12-03 09:19:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 527551    

Description Matthew Farrellee 2009-02-19 23:21:05 UTC
Description of problem:

A .schedd_address and .schedd_classad are written to the LOG directory and used by tools (q, submit, etc) on a host to contact the Schedd, instead of trying to look it up in the Collector, which is done when the files aren't present.

If the Schedd crashes these files will not get removed. In a HA Scheduler setup a crashed Schedd on node-01 may be failed over to run on node-02. This leaves the stale files on node-01. When condor_q is run on node-01 it will fail to contact the Schedd on node-02 because it checks the stale files first.


Version-Release number of selected component (if applicable):

condor-7.2.2-0.1.el5 (and likely all before it)


Additional info:

Two approaches: 1) clean up the stale files; 2) make tools check the stale files and when the information in them fails, fall back to looking up the Schedd in the Collector

(1) is tricky because logical ownership of the files may be ambiguous
(2) may be slow because on the machine with the stale files extra steps will have to be taken to find the Schedd

Comment 1 Matthew Farrellee 2009-05-01 15:50:28 UTC
Resolved for 7.3.1-0.4

commit c33afd1e6de7c57ef8d5252643d9f860b23890f8
Author: Matthew Farrellee <matt>
Date:   Mon Apr 27 15:12:07 2009 -0500

    As part of moving SCHEDD_ADDRESS_FILE and SCHEDD_DAEMON_AD_FILE, update VALID_SPOOL_FILES so PREEN doesn't wipe them
 out

commit 84afeb8fc5837d79aa1b513b8bde9f77a233b192
Author: Matthew Farrellee <matt>
Date:   Mon Apr 27 10:47:45 2009 -0500

    Moved SCHEDD_ADDRESS_FILE and SCHEDD_DAEMON_AD_FILE from LOG to SPOOL
    
    These two files are dropped by the schedd and are used by local tools,
    and Quill, to locate the Schedd without contacting the Collector. Right
    now they default to -
    
    SCHEDD_ADDRESS_FILE  = $(LOG)/.schedd_address
    SCHEDD_DAEMON_AD_FILE = $(LOG)/.schedd_classad
    
    This is all well and good, except if you are in an HA setup. When you
    have fail-over of the schedd you'll get stale files on the failed schedd
    machine. From that point forward tools, e.g. condor_q/submit, on the
    failed schedd machine will not be able to contact the schedd. The tools
    consult the files and do not fall back to a collector lookup.
    
    Solutions? 1) make the tools fall back to a collector lookup, 2) don't
    use the files at all if you are in an HA schedd setup, 3) put the files
    in SPOOL instead of LOG
    
    (1) is work with little payoff at the moment
    (2) works, but requires separate configuration when in HA mode
    (3) avoids the work of (1), allows for a consistent config over (2), and
    may even benefit from letting multiple machines in a pool avoid the
    collector lookup
    
    Downsides of (3)? Well, the file has been in $(LOG) for a long time,
    along with all other ADDRESS_FILEs, but no one should be relying on that!

Comment 2 Matthew Farrellee 2009-06-10 15:06:51 UTC
*** Bug 497854 has been marked as a duplicate of this bug. ***

Comment 4 Martin Kudlej 2009-10-22 13:34:27 UTC
I've tried it on condor-7.2.2-9 on RHEL 5.4/4.8 and i386/x86_64 and it didn't work.
I've tried it on condor-7.4.1-0.1 and it works --> VERIFIED 
I've used testing scenario(condor_q on node where condor_schedd crashed) described in Description.

Comment 5 Irina Boverman 2009-10-29 14:29:39 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
please see bug summary.

Comment 6 Lana Brindley 2009-11-09 01:21:33 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,9 @@
-please see bug summary.+Grid bug fix
+
+C: Stale .schedd_address and .schedd_classad files were being left on a host when the schedd failed in a high availability cluster
+C: When condor_q was run, it would fail to connect to the schedd, because it was checking the stale files first.
+F: The log files are now stored in SPOOL instead of LOG
+R: Multiple machines in a pool can now read the files, and stale files no longer cause a problem.
+
+
+Stale .schedd_address and .schedd_classad files were being left on a host when the schedd failed in a high availability cluster. This caused condor_q to fail to connect to the schedd. The log files were moved from LOG to SPOOL, which allows multiple machines in a pool to read the files, and stale files no longer cause a problem.

Comment 8 errata-xmlrpc 2009-12-03 09:19:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html