Bug 549389

Summary: condor_master -pidfile will stomp pidfile of running master
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Luigi Toscano <ltoscano>
Severity: medium Docs Contact:
Priority: low    
Version: 1.0CC: ltoscano
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When running "condor_master -pidfile /tmp/master.pid" twice in a row, "/tmp/master.pid" would contain the PID of the second condor_master, the one that exited immediately because it failed to get the 'InstanceLock'. With this update, "/tmp/master.pid" contains the PID of the first, still running condor_master.
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-14 15:57:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2009-12-21 14:58:14 UTC
If you run "condor_master -pidfile /tmp/master.pid" twice in a row, /tmp/master.pid will contain the PID of the second condor_master, the one that exited immediately because because it failed to get the InstanceLock.

    To reproduce:

       1. Install and configure Condor so it can run.
       2. condor_master -pidfile /tmp/master.pid
       3. cat /tmp/master.pid - Note the PID.
       4. condor_master -pidfile /tmp/master.pid
       5. cat /tmp/master.pid - Note the PID.
       6. Check MasterLog. Note that the second instance exited immediately, but that its PID matches the PID from step 5. 

    Observed behavior: /tmp/master.pid contains the PID of the second, exited condor_master.

    Expected behavior: /tmp/master.pid contains the PID of the first, still running condor_master.

Remarks:

    2009-May-26 13:55:21 by adesmet:
    #494 is a duplicate of this ticket. Contents duplicated below:

    Condor with --pidfile will write its pid before checking Instance lock

    Condor with --pidfile will write its pid before checking Instance lock This ends up writing over the pidfile created by the original condor instance. This will lead to init scripts trying to kill the wrong pid when trying to shutdown condor.

    I believe Condor should not write the pidfile until it knows that it is the one true instance.

    2009-Dec-21 08:33:00 by matt:
    To reproduce...

    $ _CONDOR_LOG=$PWD _CONDOR_MASTER_INSTANCE_LOCK=$PWD/InstanceLock condor_master -pidfile $PWD/PidFile
    $ cat PidFile
    11061
    $ _CONDOR_LOG=$PWD _CONDOR_MASTER_INSTANCE_LOCK=$PWD/InstanceLock ./condor_master -pidfile $PWD/PidFile -t -f
    ...
    12/21 09:26:59 ** PID = 11088
    ...
    12/21 09:26:59 FileLock::obtain(1) failed - errno 11 (Resource temporarily unavailable)
    12/21 09:26:59 ERROR "Can't get lock on "/home/matt/Documents/Condor/CONDOR_SRC/src/condor_master.V6/InstanceLock"" at line 955 in file master.cpp
    $ cat PidFile
    11088
    $ ps 11061 11088
      PID TTY      STAT   TIME COMMAND
    11061 ?        Ss     0:00 ./condor_master -pidfile ...

    2009-Dec-21 08:41:48 by matt:
    Desired output...

    $ _CONDOR_LOG=$PWD _CONDOR_MASTER_INSTANCE_LOCK=$PWD/InstanceLock ./condor_master -pidfile $PWD/PidFile
    $ cat PidFile
    12311
    $ _CONDOR_LOG=$PWD _CONDOR_MASTER_INSTANCE_LOCK=$PWD/InstanceLock ./condor_master -pidfile $PWD/PidFile -t -f
    ...
    12/21 09:40:54 ** PID = 12338
    ...
    12/21 09:40:54 FileLock::obtain(1) failed - errno 11 (Resource temporarily unavailable)
    12/21 09:40:54 ERROR "Can't get lock on "/home/matt/Documents/Condor/CONDOR_SRC/src/condor_master.V6/InstanceLock"" at line 955 in file master.cpp
    $ cat PidFile
    12311

    2009-Dec-21 08:55:09 by matt:
    FYI, a problem I'm not fixing is TRUNC_MASTER_LOG_ON_OPEN=TRUE will trash the MASTER_LOG before the master gets a chance to check the INSTANCE_LOCK.

Comment 1 Matthew Farrellee 2009-12-21 14:59:11 UTC
This is an issue through at least 7.4.1-0.7.1

Comment 2 Matthew Farrellee 2010-01-04 18:24:50 UTC
Fixed in 7.4.2-0.1

Comment 3 Luigi Toscano 2010-06-01 17:41:54 UTC
The new instance of condor_master does not overwrite the pidfile anymore if condor_master is already running. 

Verified on condor 7.4.3-0.16, RHEL 4.8/5.5, i386/x86_64.

Comment 4 Martin Prpič 2010-10-07 15:16:41 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When running "condor_master -pidfile /tmp/master.pid" twice in a row, "/tmp/master.pid" would contain the PID of the second condor_master, the one that exited immediately because it failed to get the 'InstanceLock'. With this update, "/tmp/master.pid" contains the PID of the first, still running condor_master.

Comment 6 errata-xmlrpc 2010-10-14 15:57:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html