Bug 703426 - dangerous (stale) pid handling in init scripts
Summary: dangerous (stale) pid handling in init scripts
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: initscripts
Version: 5.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Lukáš Nykrýn
QA Contact: qe-baseos-daemons
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-05-10 10:35 UTC by Bernd Schubert
Modified: 2013-03-11 13:42 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-03-11 12:38:49 UTC
Target Upstream Version:


Attachments (Terms of Use)
Improved start stop functions (15.61 KB, application/octet-stream)
2013-03-11 13:24 UTC, Bernd Schubert
no flags Details

Description Bernd Schubert 2011-05-10 10:35:25 UTC
Description of problem:

While I'm just improving our FhGFS init scripts, I also checked what your functions in /etc/rc.d/init.d/functions are doing if the pid is provided. And in fact it seems, those RedHat functions are rather broken when it comes to possible stale pids in /var/run.
Given by the default max pid limit of 32768, running into stale pids might easily happen.

Please look into your /etc/rc.d/init.d/functions:

__pids_var_run() {
        local base=${1##*/}
        local pid_file=${2:-/var/run/$base.pid}

        pid=
        if [ -f "$pid_file" ] ; then
                local line p
                read line < "$pid_file"
                for p in $line ; do
                        [ -z "${p//[0-9]/}" -a -d "/proc/$p" ] && pid="$pid $p"
                done
                if [ -n "$pid" ]; then
                        return 0
                fi
                return 1 # "Program is dead and /var/run pid file exists"
        fi
        return 3 # "Program is not running"
}


As you can see, it does not check if $base matches the name in /proc/$pid/

(as for example given by /proc/$pid/{exe,stat,status}).


Now in killproc():

        __pids_var_run "$1" "$pid_file"
        if [ -z "$pid_file" -a -z "$pid" ]; then
                pid="$(__pids_pidof "$1")"
        fi

So if  __pids_var_run() finds a pid in /var/run/$PIDFILE and sees that pid is in use, killproc will happily use that pid without even further checking if this pid really belongs to the right daemon! But as I said before, it might easily happen another process takes that PID and then this process would be killed!

The same applies to daemon() and status(), daemon() would not continue to start the daemon if there is a stale pid file and another process has taken that pid and status() would simply return "ok" for a non-running pid.


Version-Release number of selected component (if applicable):

5.6

How reproducible:

1) start a $daemon and let it write to /var/pid/some_pid_file

2) kill -9 $daemon

3) start other threads (just something so that many PIDs are taken, e.g. compilization or NFSD with thousands of threads). Do that until something is running with the same PID as in 1) (/var/pid/some_pid_file)

4) Run status/start/stop of our daemon, the scripts will do the wrong action.

Comment 1 Bernd Schubert 2011-05-10 11:18:10 UTC
Same issue under RHEL6

Comment 2 Lukáš Nykrýn 2013-03-11 12:38:49 UTC
I don't think that we can make proper fix for this in RHEL5. Best solution is track processes according to their cgroup not by their PID. This will be used in RHEL7.

Comment 3 Bernd Schubert 2013-03-11 13:24:45 UTC
Created attachment 708369 [details]
Improved start stop functions

Have you ever looked how much CPU time cgroups take? Enabling it on HPC or storage server systems is entirely wrong. 
For fhgfs we are shipping a modified version of your scripts. I'm going to attach it here. It checks if the pid matches the daemon name. At least that is an improvement to your version and not exactly "CANTFIX".

If you want to have a version in diff format - just run 'diff' on your own against the RHEL5 version please, I don't have a running RHEL5 available.

Comment 4 Lukáš Nykrýn 2013-03-11 13:42:23 UTC
(In reply to comment #3)
> Created attachment 708369 [details]
> Improved start stop functions
> 
> Have you ever looked how much CPU time cgroups take? Enabling it on HPC or
> storage server systems is entirely wrong. 

You don't need kernel compiled with cgroups controllers support for tracking proccesses, you need only CONFIG_CGROUPS=y allowed. This shouldn't be performance sensitive (for more info see http://0pointer.de/blog/projects/cgroups-vs-cgroups.html)

> For fhgfs we are shipping a modified version of your scripts. I'm going to
> attach it here. It checks if the pid matches the daemon name. At least that
> is an improvement to your version and not exactly "CANTFIX".
> 
> If you want to have a version in diff format - just run 'diff' on your own
> against the RHEL5 version please, I don't have a running RHEL5 available.

Yes current behavior can be improved, but I don't think that there is a solution which will completely solve this issue.


Note You need to log in before you can comment on or make changes to this bug.