Bug 63370 - strace -f -p pid causes pid to hang after a child exits
strace -f -p pid causes pid to hang after a child exits
Status: CLOSED CURRENTRELEASE
Product: Red Hat Linux
Classification: Retired
Component: strace (Show other bugs)
7.3
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Jakub Jelinek
Brian Brock
:
: 64560 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-04-12 20:22 EDT by Jay Berkenbilt
Modified: 2007-04-18 12:41 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2002-08-09 10:50:20 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
source code for simple example program (535 bytes, text/plain)
2002-04-12 20:26 EDT, Jay Berkenbilt
no flags Details

  None (edit)
Description Jay Berkenbilt 2002-04-12 20:22:50 EDT
Description of Problem:

strace -f -p pid, when run on the process ID of a process that forks, will cause
pid and its children to hang irrecoverably (except with kill -9, as far as I can
tell) when the child process exits, at least in some easily reproducible cases.

Version-Release number of selected component (if applicable):

strace-4.4-3
glibc-2.2.5-32
kernel-2.4.18-0.20

How Reproducible:

always

Steps to Reproduce:
with sendmail:
1. ps aux | grep sendmail   to get sendmail's pid
2. strace -f -p pid         where pid is sendmail's pid
3. telnet localhost smtp
4. QUIT

you will now find that sendmail and the strace are both hung.  You can get out
of the strace with CTRL-Z and kill -9.  You will have to kill -9 all the
sendmail processes, which show up in ps with a status of T.

Or you can reproduce this more simply with my demo program, attached as a file.
 This demo program, in a loop, forks a child which prints a message, waits one
second, prints another message, and exits.  The parent waits for the child to
exit with waitpid.  If you run the program normally, you will observe that there
are no zombie instances of it hanging around.  In other words, it is a
well-behaved process.

Run this program in the background.  strace -f -p pid where pid is its primary
process ID.  You can observe the hanging behavior right away.

Actual Results:

The traced program and strace both hang

Expected Results:

You should be able to exit strace with CTRL-C and have it exit and the traced
program return to its normal status.

Additional Information:
	
This problem is not reproducible on a RedHat 7.2 system with multiple updates
from up2date applied.

This problem does not occur with strace-4.3-2 from RedHat 7.2

I listed the severity of this problem has HIGH because someone could cripple
their system unknowingly, or at least wind up having to forcibly kill system
processes, by running strace on some system daemons.  If you disagree, I'm sure
you will change the severity.

This problem has existed since skipjack beta-1, but I only reported it now
because I wanted to boil it down to a simplistic example and rule out other
possible causes of the problem.  I am now running skipjack beta-2 (installed
from scratch) with all current updates available from up2date applied.
Comment 1 Jay Berkenbilt 2002-04-12 20:26:04 EDT
Created attachment 53691 [details]
source code for simple example program
Comment 2 Bill Nottingham 2002-04-14 23:26:59 EDT

*** This bug has been marked as a duplicate of 62591 ***
Comment 3 Jay Berkenbilt 2002-05-16 12:44:07 EDT
I have reopened this bug, changed it to a RedHat 7.3 bug (rather than a skipjack
bug), and set the severity to normal instead of high.  Although this bug was
marked as a duplicate of bug 62591, it really isn't.  The problem reported there
does indeed appear to have been fixed by strace-4.4-4, but the problem reported
here has not been entirely fixed, though the nature of the problem has changed.
 Before, I had marked this as high severity because it was possible to
unknowingly disable the system with this problem.  That is no longer possible
though as the nature of what goes wrong has changed.  For this reason, I have
dropped the severity back down to normal.

The behavior now is that one or more of the traced processes may have their
status changed to T, but a kill -CONT fixes the problem and lets things continue
where they left off.  For example, again run my C program in the background and
run strace -f -p pid where pid is the primary process assigned to the job as
returned by jobs -l.

It is now possible to get out of strace with CTRL-C.  When you do, one of the
child processes will be left with 'T' as its status.  If you kill -CONT that
process, its parent gets left with 'T' as its status.  If you kill -CONT that
process, everything is back to normal.  A kill -CONT to the job from the shell
from which it was started (propagating the signal to the process group) should
also work.

I think the specifics are that whatever process strace is printing information
about at the time that it is interrupted remains STOPped, at least if the
process is in certain states.

Let me give a more exact recipe for reproducing the problem.

Start my program in the background:

$ ./a.out &
[1] 4790

Now, in another window, type strace -f -p 4790.

You should see plenty output that more or less alternates between 4790 and each
new child process.  Wait until a child process enters nanosleep().  As soon as
it does, hit CTRL-C on the strace process.

Do ps lpid where pid is the child process that was in nanosleep.  You'll see
that it has T as its status.  Send kill -CONT to it.  Everything may return to
normal, or the parent process may be stopped.  Run strace again.  This time,
CTRL-C when the parent is in nanosleep.  The same thing happens -- the parent
process ends up STOPped.  Hit return in the window with the shell that
originally started the program. You'll see

[1]+  Stopped           ./a.out

Run kill -CONT %, and everything again returns to normal.

I do not believe that interrupting strace should leave the child process stopped.

Comment 4 Trond Eivind Glomsrxd 2002-05-23 10:58:29 EDT
*** Bug 64560 has been marked as a duplicate of this bug. ***
Comment 5 Jay Berkenbilt 2002-08-05 21:49:55 EDT
This problem still exists in limbo.  Is anyone investigating?
Comment 6 Jay Berkenbilt 2002-08-09 10:50:15 EDT
I'm going to post a new bug for the one small aspect of this that is still
around.  Then maybe this one should be closed again.
Comment 7 Jay Berkenbilt 2002-08-09 11:06:33 EDT
Since I reopened this bug and nothing happened, I am taking the liberty of
closing it again.  I have reported the one remaining problem in bug 71166.

Note You need to log in before you can comment on or make changes to this bug.