Red Hat Bugzilla – Bug 835838
Process does not exist, but "/bin/kill -0" and "kill -0" return 0
Last modified: 2012-06-30 13:57:26 EDT
Created attachment 594724 [details]
Description of problem:
A process with PID 6156 has been forcefully terminated. The process is not in the process list.
However, "kill -0" insists that the process exists and does so as long as a SIGTERM has not been sent.
See bash session attached.
Version-Release number of selected component (if applicable):
Not reproduced yet.
No invisible process
The kill command is a bit tricky. It exists in two separate upstream sources (procps and util-linux) and it's currently disabled in the procps build to avoid conflicts. As it's shipped with util-linux only, I'm going to change the component to util-linux.
as you can see from your strace output there is no issue with kill(1).
It seems that the 6156 in your example is thread ID and the TID is
possible to use for kill(2) syscall.
The threads are not included in the "ls /proc" output (not included in readdir()
) to avoid performance problems on many systems with huge number
of threads, but you can address the threads directly (e.g. ls /proc/<TID>).
For example gnome-terminal with four threads:
$ ps -eLf | grep 2554
UID PID PPID LWP C NLWP STIME TTY TIME CM
kzak 2554 1994 2554 0 4 Jun27 ? 00:00:02 gnome-terminal
kzak 2554 1994 2557 0 4 Jun27 ? 00:00:00 gnome-terminal
kzak 2554 1994 2558 0 4 Jun27 ? 00:00:00 gnome-terminal
kzak 2554 1994 2561 0 4 Jun27 ? 00:00:00 gnome-terminal
The important is LWP column (thread ID).
$ ls /proc/2554/task/
2554 2557 2558 2561
Let's play with thread 2557:
$ cat /proc/2554/task/2557/comm
$ ls /proc | grep 2557 # <<< nothing !
but the task is accessible from top-level /proc directory if
full path is specified
$ cat /proc/2557/comm
... and now kill:
$ strace -e kill kill -0 2557
kill(2557, SIG_0) = 0
+++ exited with 0 +++
success. Not a bug from my point of view.
> It seems that the 6156 in your example is thread ID and the TID is possible to use for kill(2) syscall.
Forehead slap! I didn't think of this at all. Good grief. Maybe I'm too old school.
The problem occurred because one of our scripts was of the opinion that the PID an earlier instance had written to a pidfile/lockfile was still valid and it just refused to continue. It seems that the space of PIDs is becoming somewhat constrained now that it is shared with TIDs. Time to move PID/TID to a sparsely populated 128bit space. Maybe.
On the other hand, there is still a problem somewhere..
- The strace shows a call to "kill" but:
- The kill(2) manpage doesn't mention threads at all.
- There is a specially designed tgkill(2) to signal threads.
Its manpage says; "By contrast, kill(2) can only be used to send a
signal to a process (i.e., thread group) as a whole, and the signal will be
delivered to an arbitrary thread within that process.)"
So either the kill(2) manpage and the tgkill(2) are wrong / need to be completed or there is an implementation problem with the kill syscall.