Red Hat Bugzilla – Bug 63370
strace -f -p pid causes pid to hang after a child exits
Last modified: 2007-04-18 12:41:57 EDT
Description of Problem:
strace -f -p pid, when run on the process ID of a process that forks, will cause
pid and its children to hang irrecoverably (except with kill -9, as far as I can
tell) when the child process exits, at least in some easily reproducible cases.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. ps aux | grep sendmail to get sendmail's pid
2. strace -f -p pid where pid is sendmail's pid
3. telnet localhost smtp
you will now find that sendmail and the strace are both hung. You can get out
of the strace with CTRL-Z and kill -9. You will have to kill -9 all the
sendmail processes, which show up in ps with a status of T.
Or you can reproduce this more simply with my demo program, attached as a file.
This demo program, in a loop, forks a child which prints a message, waits one
second, prints another message, and exits. The parent waits for the child to
exit with waitpid. If you run the program normally, you will observe that there
are no zombie instances of it hanging around. In other words, it is a
Run this program in the background. strace -f -p pid where pid is its primary
process ID. You can observe the hanging behavior right away.
The traced program and strace both hang
You should be able to exit strace with CTRL-C and have it exit and the traced
program return to its normal status.
This problem is not reproducible on a RedHat 7.2 system with multiple updates
from up2date applied.
This problem does not occur with strace-4.3-2 from RedHat 7.2
I listed the severity of this problem has HIGH because someone could cripple
their system unknowingly, or at least wind up having to forcibly kill system
processes, by running strace on some system daemons. If you disagree, I'm sure
you will change the severity.
This problem has existed since skipjack beta-1, but I only reported it now
because I wanted to boil it down to a simplistic example and rule out other
possible causes of the problem. I am now running skipjack beta-2 (installed
from scratch) with all current updates available from up2date applied.
Created attachment 53691 [details]
source code for simple example program
*** This bug has been marked as a duplicate of 62591 ***
I have reopened this bug, changed it to a RedHat 7.3 bug (rather than a skipjack
bug), and set the severity to normal instead of high. Although this bug was
marked as a duplicate of bug 62591, it really isn't. The problem reported there
does indeed appear to have been fixed by strace-4.4-4, but the problem reported
here has not been entirely fixed, though the nature of the problem has changed.
Before, I had marked this as high severity because it was possible to
unknowingly disable the system with this problem. That is no longer possible
though as the nature of what goes wrong has changed. For this reason, I have
dropped the severity back down to normal.
The behavior now is that one or more of the traced processes may have their
status changed to T, but a kill -CONT fixes the problem and lets things continue
where they left off. For example, again run my C program in the background and
run strace -f -p pid where pid is the primary process assigned to the job as
returned by jobs -l.
It is now possible to get out of strace with CTRL-C. When you do, one of the
child processes will be left with 'T' as its status. If you kill -CONT that
process, its parent gets left with 'T' as its status. If you kill -CONT that
process, everything is back to normal. A kill -CONT to the job from the shell
from which it was started (propagating the signal to the process group) should
I think the specifics are that whatever process strace is printing information
about at the time that it is interrupted remains STOPped, at least if the
process is in certain states.
Let me give a more exact recipe for reproducing the problem.
Start my program in the background:
$ ./a.out &
Now, in another window, type strace -f -p 4790.
You should see plenty output that more or less alternates between 4790 and each
new child process. Wait until a child process enters nanosleep(). As soon as
it does, hit CTRL-C on the strace process.
Do ps lpid where pid is the child process that was in nanosleep. You'll see
that it has T as its status. Send kill -CONT to it. Everything may return to
normal, or the parent process may be stopped. Run strace again. This time,
CTRL-C when the parent is in nanosleep. The same thing happens -- the parent
process ends up STOPped. Hit return in the window with the shell that
originally started the program. You'll see
+ Stopped ./a.out
Run kill -CONT %, and everything again returns to normal.
I do not believe that interrupting strace should leave the child process stopped.
*** Bug 64560 has been marked as a duplicate of this bug. ***
This problem still exists in limbo. Is anyone investigating?
I'm going to post a new bug for the one small aspect of this that is still
around. Then maybe this one should be closed again.
Since I reopened this bug and nothing happened, I am taking the liberty of
closing it again. I have reported the one remaining problem in bug 71166.