Description of Problem: strace -f -p pid, when run on the process ID of a process that forks, will cause pid and its children to hang irrecoverably (except with kill -9, as far as I can tell) when the child process exits, at least in some easily reproducible cases. Version-Release number of selected component (if applicable): strace-4.4-3 glibc-2.2.5-32 kernel-2.4.18-0.20 How Reproducible: always Steps to Reproduce: with sendmail: 1. ps aux | grep sendmail to get sendmail's pid 2. strace -f -p pid where pid is sendmail's pid 3. telnet localhost smtp 4. QUIT you will now find that sendmail and the strace are both hung. You can get out of the strace with CTRL-Z and kill -9. You will have to kill -9 all the sendmail processes, which show up in ps with a status of T. Or you can reproduce this more simply with my demo program, attached as a file. This demo program, in a loop, forks a child which prints a message, waits one second, prints another message, and exits. The parent waits for the child to exit with waitpid. If you run the program normally, you will observe that there are no zombie instances of it hanging around. In other words, it is a well-behaved process. Run this program in the background. strace -f -p pid where pid is its primary process ID. You can observe the hanging behavior right away. Actual Results: The traced program and strace both hang Expected Results: You should be able to exit strace with CTRL-C and have it exit and the traced program return to its normal status. Additional Information: This problem is not reproducible on a RedHat 7.2 system with multiple updates from up2date applied. This problem does not occur with strace-4.3-2 from RedHat 7.2 I listed the severity of this problem has HIGH because someone could cripple their system unknowingly, or at least wind up having to forcibly kill system processes, by running strace on some system daemons. If you disagree, I'm sure you will change the severity. This problem has existed since skipjack beta-1, but I only reported it now because I wanted to boil it down to a simplistic example and rule out other possible causes of the problem. I am now running skipjack beta-2 (installed from scratch) with all current updates available from up2date applied.
Created attachment 53691 [details] source code for simple example program
*** This bug has been marked as a duplicate of 62591 ***
I have reopened this bug, changed it to a RedHat 7.3 bug (rather than a skipjack bug), and set the severity to normal instead of high. Although this bug was marked as a duplicate of bug 62591, it really isn't. The problem reported there does indeed appear to have been fixed by strace-4.4-4, but the problem reported here has not been entirely fixed, though the nature of the problem has changed. Before, I had marked this as high severity because it was possible to unknowingly disable the system with this problem. That is no longer possible though as the nature of what goes wrong has changed. For this reason, I have dropped the severity back down to normal. The behavior now is that one or more of the traced processes may have their status changed to T, but a kill -CONT fixes the problem and lets things continue where they left off. For example, again run my C program in the background and run strace -f -p pid where pid is the primary process assigned to the job as returned by jobs -l. It is now possible to get out of strace with CTRL-C. When you do, one of the child processes will be left with 'T' as its status. If you kill -CONT that process, its parent gets left with 'T' as its status. If you kill -CONT that process, everything is back to normal. A kill -CONT to the job from the shell from which it was started (propagating the signal to the process group) should also work. I think the specifics are that whatever process strace is printing information about at the time that it is interrupted remains STOPped, at least if the process is in certain states. Let me give a more exact recipe for reproducing the problem. Start my program in the background: $ ./a.out & [1] 4790 Now, in another window, type strace -f -p 4790. You should see plenty output that more or less alternates between 4790 and each new child process. Wait until a child process enters nanosleep(). As soon as it does, hit CTRL-C on the strace process. Do ps lpid where pid is the child process that was in nanosleep. You'll see that it has T as its status. Send kill -CONT to it. Everything may return to normal, or the parent process may be stopped. Run strace again. This time, CTRL-C when the parent is in nanosleep. The same thing happens -- the parent process ends up STOPped. Hit return in the window with the shell that originally started the program. You'll see [1]+ Stopped ./a.out Run kill -CONT %, and everything again returns to normal. I do not believe that interrupting strace should leave the child process stopped.
*** Bug 64560 has been marked as a duplicate of this bug. ***
This problem still exists in limbo. Is anyone investigating?
I'm going to post a new bug for the one small aspect of this that is still around. Then maybe this one should be closed again.
Since I reopened this bug and nothing happened, I am taking the liberty of closing it again. I have reported the one remaining problem in bug 71166.