Bug 471902

Summary: condor_master abandons pidfile on -restart
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: gridAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Kim van der Riet <kim.vdriet>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.0CC: dan, pmackinn
Target Milestone: 1.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-02-04 16:05:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2008-11-17 16:28:17 UTC
Description of problem:

The master appears to forget about its pidfile when condor_restart -master'd


Version-Release number of selected component (if applicable):

7.2.0-0.1 and before


How reproducible:

Always


Steps to Reproduce:
1. run condor_master via init script
2. run condor_restart -master
3. notice that the pidfile remains but ps auxwww | grep condor_master no longer does
4. when master is shutdown the pidfile is not cleaned up

Comment 1 Dan Bradley 2008-11-20 22:25:45 UTC
This has been fixed for Condor 7.2.0.  --Dan

Comment 2 Matthew Farrellee 2008-11-20 22:31:31 UTC
Thanks Dan.

The fix will appear in 7.2.0-0.3

Comment 4 Pete MacKinnon 2008-12-11 18:12:42 UTC
[root@north-13 ~]# ll /var/lib/condor/condor_master.pid 
ls: /var/lib/condor/condor_master.pid: No such file or directory
[root@north-13 ~]# service condor start
Starting Condor daemons:                                   [  OK  ]
[root@north-13 ~]# ll /var/lib/condor/condor_master.pid 
-rw-r--r-- 1 condor condor 6 Dec 11 13:10 /var/lib/condor/condor_master.pid
[root@north-13 ~]# condor_restart -master
Sent "Restart" command to local master
[root@north-13 ~]# ll /var/lib/condor/condor_master.pid 
-rw-r--r-- 1 condor condor 6 Dec 11 13:10 /var/lib/condor/condor_master.pid
[root@north-13 ~]# ps auxwww | grep condor_master
condor   13555  0.0  0.0 113596  7108 ?        Ssl  13:10   0:00 condor_master -pidfile /var/lib/condor/condor_master.pid
root     13806  0.0  0.0  61172   728 pts/1    S+   13:11   0:00 grep condor_master
[root@north-13 ~]# service condor stop
Stopping Condor daemons:                                   [  OK  ]
[root@north-13 ~]# ps auxwww | grep condor_master
root     13824  0.0  0.0  61172   728 pts/1    S+   13:12   0:00 grep condor_master
[root@north-13 ~]# ll /var/lib/condor/condor_master.pid 
ls: /var/lib/condor/condor_master.pid: No such file or directory
[root@north-13 ~]#

Comment 5 Pete MacKinnon 2008-12-11 18:39:26 UTC
the pidfile abandoning issue comes about when the master restarts itself

the master monitors the executables, on disk, from DAEMON_LIST. when they are new it will re'exec the daemon. it even does this for itself. in the past it wouldn't keep its argv when re'execing, so it would not clean up. so, still more testing to do on 471902

Comment 6 Pete MacKinnon 2008-12-11 22:14:42 UTC
Everything seems OK. PID file gets cleaned up after a self-restart followed by a shutdown sometime later. However the PID never changes since the master is restarted with execv.

"Like all of the exec functions, execv replaces the calling process image with a new process image. This has the effect of running a new progam with the process ID of the calling process. Note that a new process is not started; the new process image simply overlays the original process image. The execv function is most commonly used to overlay a process image that has been created by a call to the fork  function."

[root@north-13 condor]# cat condor_master.pid
17079
[root@north-13 condor]# !ps
ps -ef | grep condor_master
condor   17079     1  0 16:25 ?        00:00:00 condor_master -pidfile /var/lib/condor/condor_master.pid
[root@north-13 condor]# grep modified log/MasterLog 
12/11 16:31:00 /usr/sbin/condor_master was modified, restarting /usr/sbin/condor_master.
[root@north-13 condor]# !ps
ps -ef | grep condor_master
condor   17079     1  0 16:25 ?        00:00:00 condor_master -pidfile /var/lib/condor/condor_master.pid
[root@north-13 condor]# !cat
cat condor_master.pid
17079
[root@north-13 condor]# service condor stop
Stopping Condor daemons:                                   [  OK  ]
[root@north-13 condor]# ll
total 56
-rw-r--r-- 1 root   root    400 Nov 14 11:43 condor_config.local
-rw-r--r-- 1 root   root    106 Nov 24 18:07 condor_config.overrides
-rw-r--r-- 1 root   root     61 Nov 21 23:07 condor_config.overrides~
drwxr-xr-x 2 condor condor 4096 Dec 11 16:33 execute
drwxr-xr-x 2 root   root   4096 Dec  4 15:21 feature_configs
drwxr-xr-x 2 condor condor 4096 Dec 11 16:43 log
drwxr-xr-x 2 condor condor 4096 Dec 10 09:29 spool

Comment 8 errata-xmlrpc 2009-02-04 16:05:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0036.html