If you run "condor_master -pidfile /tmp/master.pid" twice in a row, /tmp/master.pid will contain the PID of the second condor_master, the one that exited immediately because because it failed to get the InstanceLock. To reproduce: 1. Install and configure Condor so it can run. 2. condor_master -pidfile /tmp/master.pid 3. cat /tmp/master.pid - Note the PID. 4. condor_master -pidfile /tmp/master.pid 5. cat /tmp/master.pid - Note the PID. 6. Check MasterLog. Note that the second instance exited immediately, but that its PID matches the PID from step 5. Observed behavior: /tmp/master.pid contains the PID of the second, exited condor_master. Expected behavior: /tmp/master.pid contains the PID of the first, still running condor_master. Remarks: 2009-May-26 13:55:21 by adesmet: #494 is a duplicate of this ticket. Contents duplicated below: Condor with --pidfile will write its pid before checking Instance lock Condor with --pidfile will write its pid before checking Instance lock This ends up writing over the pidfile created by the original condor instance. This will lead to init scripts trying to kill the wrong pid when trying to shutdown condor. I believe Condor should not write the pidfile until it knows that it is the one true instance. 2009-Dec-21 08:33:00 by matt: To reproduce... $ _CONDOR_LOG=$PWD _CONDOR_MASTER_INSTANCE_LOCK=$PWD/InstanceLock condor_master -pidfile $PWD/PidFile $ cat PidFile 11061 $ _CONDOR_LOG=$PWD _CONDOR_MASTER_INSTANCE_LOCK=$PWD/InstanceLock ./condor_master -pidfile $PWD/PidFile -t -f ... 12/21 09:26:59 ** PID = 11088 ... 12/21 09:26:59 FileLock::obtain(1) failed - errno 11 (Resource temporarily unavailable) 12/21 09:26:59 ERROR "Can't get lock on "/home/matt/Documents/Condor/CONDOR_SRC/src/condor_master.V6/InstanceLock"" at line 955 in file master.cpp $ cat PidFile 11088 $ ps 11061 11088 PID TTY STAT TIME COMMAND 11061 ? Ss 0:00 ./condor_master -pidfile ... 2009-Dec-21 08:41:48 by matt: Desired output... $ _CONDOR_LOG=$PWD _CONDOR_MASTER_INSTANCE_LOCK=$PWD/InstanceLock ./condor_master -pidfile $PWD/PidFile $ cat PidFile 12311 $ _CONDOR_LOG=$PWD _CONDOR_MASTER_INSTANCE_LOCK=$PWD/InstanceLock ./condor_master -pidfile $PWD/PidFile -t -f ... 12/21 09:40:54 ** PID = 12338 ... 12/21 09:40:54 FileLock::obtain(1) failed - errno 11 (Resource temporarily unavailable) 12/21 09:40:54 ERROR "Can't get lock on "/home/matt/Documents/Condor/CONDOR_SRC/src/condor_master.V6/InstanceLock"" at line 955 in file master.cpp $ cat PidFile 12311 2009-Dec-21 08:55:09 by matt: FYI, a problem I'm not fixing is TRUNC_MASTER_LOG_ON_OPEN=TRUE will trash the MASTER_LOG before the master gets a chance to check the INSTANCE_LOCK.
This is an issue through at least 7.4.1-0.7.1
Fixed in 7.4.2-0.1
The new instance of condor_master does not overwrite the pidfile anymore if condor_master is already running. Verified on condor 7.4.3-0.16, RHEL 4.8/5.5, i386/x86_64.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: When running "condor_master -pidfile /tmp/master.pid" twice in a row, "/tmp/master.pid" would contain the PID of the second condor_master, the one that exited immediately because it failed to get the 'InstanceLock'. With this update, "/tmp/master.pid" contains the PID of the first, still running condor_master.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html