Bug 549389 - condor_master -pidfile will stomp pidfile of running master
Summary: condor_master -pidfile will stomp pidfile of running master
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.0
Hardware: All
OS: Linux
low
medium
Target Milestone: 1.3
: ---
Assignee: Matthew Farrellee
QA Contact: Luigi Toscano
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-12-21 14:58 UTC by Matthew Farrellee
Modified: 2010-10-14 15:57 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When running "condor_master -pidfile /tmp/master.pid" twice in a row, "/tmp/master.pid" would contain the PID of the second condor_master, the one that exited immediately because it failed to get the 'InstanceLock'. With this update, "/tmp/master.pid" contains the PID of the first, still running condor_master.
Clone Of:
Environment:
Last Closed: 2010-10-14 15:57:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0773 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3 2010-10-14 15:56:44 UTC

Description Matthew Farrellee 2009-12-21 14:58:14 UTC
If you run "condor_master -pidfile /tmp/master.pid" twice in a row, /tmp/master.pid will contain the PID of the second condor_master, the one that exited immediately because because it failed to get the InstanceLock.

    To reproduce:

       1. Install and configure Condor so it can run.
       2. condor_master -pidfile /tmp/master.pid
       3. cat /tmp/master.pid - Note the PID.
       4. condor_master -pidfile /tmp/master.pid
       5. cat /tmp/master.pid - Note the PID.
       6. Check MasterLog. Note that the second instance exited immediately, but that its PID matches the PID from step 5. 

    Observed behavior: /tmp/master.pid contains the PID of the second, exited condor_master.

    Expected behavior: /tmp/master.pid contains the PID of the first, still running condor_master.

Remarks:

    2009-May-26 13:55:21 by adesmet:
    #494 is a duplicate of this ticket. Contents duplicated below:

    Condor with --pidfile will write its pid before checking Instance lock

    Condor with --pidfile will write its pid before checking Instance lock This ends up writing over the pidfile created by the original condor instance. This will lead to init scripts trying to kill the wrong pid when trying to shutdown condor.

    I believe Condor should not write the pidfile until it knows that it is the one true instance.

    2009-Dec-21 08:33:00 by matt:
    To reproduce...

    $ _CONDOR_LOG=$PWD _CONDOR_MASTER_INSTANCE_LOCK=$PWD/InstanceLock condor_master -pidfile $PWD/PidFile
    $ cat PidFile
    11061
    $ _CONDOR_LOG=$PWD _CONDOR_MASTER_INSTANCE_LOCK=$PWD/InstanceLock ./condor_master -pidfile $PWD/PidFile -t -f
    ...
    12/21 09:26:59 ** PID = 11088
    ...
    12/21 09:26:59 FileLock::obtain(1) failed - errno 11 (Resource temporarily unavailable)
    12/21 09:26:59 ERROR "Can't get lock on "/home/matt/Documents/Condor/CONDOR_SRC/src/condor_master.V6/InstanceLock"" at line 955 in file master.cpp
    $ cat PidFile
    11088
    $ ps 11061 11088
      PID TTY      STAT   TIME COMMAND
    11061 ?        Ss     0:00 ./condor_master -pidfile ...

    2009-Dec-21 08:41:48 by matt:
    Desired output...

    $ _CONDOR_LOG=$PWD _CONDOR_MASTER_INSTANCE_LOCK=$PWD/InstanceLock ./condor_master -pidfile $PWD/PidFile
    $ cat PidFile
    12311
    $ _CONDOR_LOG=$PWD _CONDOR_MASTER_INSTANCE_LOCK=$PWD/InstanceLock ./condor_master -pidfile $PWD/PidFile -t -f
    ...
    12/21 09:40:54 ** PID = 12338
    ...
    12/21 09:40:54 FileLock::obtain(1) failed - errno 11 (Resource temporarily unavailable)
    12/21 09:40:54 ERROR "Can't get lock on "/home/matt/Documents/Condor/CONDOR_SRC/src/condor_master.V6/InstanceLock"" at line 955 in file master.cpp
    $ cat PidFile
    12311

    2009-Dec-21 08:55:09 by matt:
    FYI, a problem I'm not fixing is TRUNC_MASTER_LOG_ON_OPEN=TRUE will trash the MASTER_LOG before the master gets a chance to check the INSTANCE_LOCK.

Comment 1 Matthew Farrellee 2009-12-21 14:59:11 UTC
This is an issue through at least 7.4.1-0.7.1

Comment 2 Matthew Farrellee 2010-01-04 18:24:50 UTC
Fixed in 7.4.2-0.1

Comment 3 Luigi Toscano 2010-06-01 17:41:54 UTC
The new instance of condor_master does not overwrite the pidfile anymore if condor_master is already running. 

Verified on condor 7.4.3-0.16, RHEL 4.8/5.5, i386/x86_64.

Comment 4 Martin Prpič 2010-10-07 15:16:41 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When running "condor_master -pidfile /tmp/master.pid" twice in a row, "/tmp/master.pid" would contain the PID of the second condor_master, the one that exited immediately because it failed to get the 'InstanceLock'. With this update, "/tmp/master.pid" contains the PID of the first, still running condor_master.

Comment 6 errata-xmlrpc 2010-10-14 15:57:41 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html


Note You need to log in before you can comment on or make changes to this bug.