Bug 486487 - Stale .schedd_address and .schedd_classad
Stale .schedd_address and .schedd_classad
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
1.1
All Linux
low Severity medium
: 1.2
: ---
Assigned To: Matthew Farrellee
Martin Kudlej
:
: 497854 (view as bug list)
Depends On:
Blocks: 527551
  Show dependency treegraph
 
Reported: 2009-02-19 18:21 EST by Matthew Farrellee
Modified: 2009-12-03 04:19 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Grid bug fix C: Stale .schedd_address and .schedd_classad files were being left on a host when the schedd failed in a high availability cluster C: When condor_q was run, it would fail to connect to the schedd, because it was checking the stale files first. F: The log files are now stored in SPOOL instead of LOG R: Multiple machines in a pool can now read the files, and stale files no longer cause a problem. Stale .schedd_address and .schedd_classad files were being left on a host when the schedd failed in a high availability cluster. This caused condor_q to fail to connect to the schedd. The log files were moved from LOG to SPOOL, which allows multiple machines in a pool to read the files, and stale files no longer cause a problem.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-12-03 04:19:27 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:1633 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid Version 1.2 2009-12-03 04:15:33 EST

  None (edit)
Description Matthew Farrellee 2009-02-19 18:21:05 EST
Description of problem:

A .schedd_address and .schedd_classad are written to the LOG directory and used by tools (q, submit, etc) on a host to contact the Schedd, instead of trying to look it up in the Collector, which is done when the files aren't present.

If the Schedd crashes these files will not get removed. In a HA Scheduler setup a crashed Schedd on node-01 may be failed over to run on node-02. This leaves the stale files on node-01. When condor_q is run on node-01 it will fail to contact the Schedd on node-02 because it checks the stale files first.


Version-Release number of selected component (if applicable):

condor-7.2.2-0.1.el5 (and likely all before it)


Additional info:

Two approaches: 1) clean up the stale files; 2) make tools check the stale files and when the information in them fails, fall back to looking up the Schedd in the Collector

(1) is tricky because logical ownership of the files may be ambiguous
(2) may be slow because on the machine with the stale files extra steps will have to be taken to find the Schedd
Comment 1 Matthew Farrellee 2009-05-01 11:50:28 EDT
Resolved for 7.3.1-0.4

commit c33afd1e6de7c57ef8d5252643d9f860b23890f8
Author: Matthew Farrellee <matt@redhat.com>
Date:   Mon Apr 27 15:12:07 2009 -0500

    As part of moving SCHEDD_ADDRESS_FILE and SCHEDD_DAEMON_AD_FILE, update VALID_SPOOL_FILES so PREEN doesn't wipe them
 out

commit 84afeb8fc5837d79aa1b513b8bde9f77a233b192
Author: Matthew Farrellee <matt@redhat.com>
Date:   Mon Apr 27 10:47:45 2009 -0500

    Moved SCHEDD_ADDRESS_FILE and SCHEDD_DAEMON_AD_FILE from LOG to SPOOL
    
    These two files are dropped by the schedd and are used by local tools,
    and Quill, to locate the Schedd without contacting the Collector. Right
    now they default to -
    
    SCHEDD_ADDRESS_FILE  = $(LOG)/.schedd_address
    SCHEDD_DAEMON_AD_FILE = $(LOG)/.schedd_classad
    
    This is all well and good, except if you are in an HA setup. When you
    have fail-over of the schedd you'll get stale files on the failed schedd
    machine. From that point forward tools, e.g. condor_q/submit, on the
    failed schedd machine will not be able to contact the schedd. The tools
    consult the files and do not fall back to a collector lookup.
    
    Solutions? 1) make the tools fall back to a collector lookup, 2) don't
    use the files at all if you are in an HA schedd setup, 3) put the files
    in SPOOL instead of LOG
    
    (1) is work with little payoff at the moment
    (2) works, but requires separate configuration when in HA mode
    (3) avoids the work of (1), allows for a consistent config over (2), and
    may even benefit from letting multiple machines in a pool avoid the
    collector lookup
    
    Downsides of (3)? Well, the file has been in $(LOG) for a long time,
    along with all other ADDRESS_FILEs, but no one should be relying on that!
Comment 2 Matthew Farrellee 2009-06-10 11:06:51 EDT
*** Bug 497854 has been marked as a duplicate of this bug. ***
Comment 4 Martin Kudlej 2009-10-22 09:34:27 EDT
I've tried it on condor-7.2.2-9 on RHEL 5.4/4.8 and i386/x86_64 and it didn't work.
I've tried it on condor-7.4.1-0.1 and it works --> VERIFIED 
I've used testing scenario(condor_q on node where condor_schedd crashed) described in Description.
Comment 5 Irina Boverman 2009-10-29 10:29:39 EDT
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
please see bug summary.
Comment 6 Lana Brindley 2009-11-08 20:21:33 EST
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,9 @@
-please see bug summary.+Grid bug fix
+
+C: Stale .schedd_address and .schedd_classad files were being left on a host when the schedd failed in a high availability cluster
+C: When condor_q was run, it would fail to connect to the schedd, because it was checking the stale files first.
+F: The log files are now stored in SPOOL instead of LOG
+R: Multiple machines in a pool can now read the files, and stale files no longer cause a problem.
+
+
+Stale .schedd_address and .schedd_classad files were being left on a host when the schedd failed in a high availability cluster. This caused condor_q to fail to connect to the schedd. The log files were moved from LOG to SPOOL, which allows multiple machines in a pool to read the files, and stale files no longer cause a problem.
Comment 8 errata-xmlrpc 2009-12-03 04:19:27 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html

Note You need to log in before you can comment on or make changes to this bug.