Bug 63370

Summary: strace -f -p pid causes pid to hang after a child exits
Product: [Retired] Red Hat Linux Reporter: Jay Berkenbilt <ejb>
Component: straceAssignee: Jakub Jelinek <jakub>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: bbaetz, ejb, geoffrey, jwm
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2002-08-09 14:50:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
source code for simple example program none

Description Jay Berkenbilt 2002-04-13 00:22:50 UTC
Description of Problem:

strace -f -p pid, when run on the process ID of a process that forks, will cause
pid and its children to hang irrecoverably (except with kill -9, as far as I can
tell) when the child process exits, at least in some easily reproducible cases.

Version-Release number of selected component (if applicable):

strace-4.4-3
glibc-2.2.5-32
kernel-2.4.18-0.20

How Reproducible:

always

Steps to Reproduce:
with sendmail:
1. ps aux | grep sendmail   to get sendmail's pid
2. strace -f -p pid         where pid is sendmail's pid
3. telnet localhost smtp
4. QUIT

you will now find that sendmail and the strace are both hung.  You can get out
of the strace with CTRL-Z and kill -9.  You will have to kill -9 all the
sendmail processes, which show up in ps with a status of T.

Or you can reproduce this more simply with my demo program, attached as a file.
 This demo program, in a loop, forks a child which prints a message, waits one
second, prints another message, and exits.  The parent waits for the child to
exit with waitpid.  If you run the program normally, you will observe that there
are no zombie instances of it hanging around.  In other words, it is a
well-behaved process.

Run this program in the background.  strace -f -p pid where pid is its primary
process ID.  You can observe the hanging behavior right away.

Actual Results:

The traced program and strace both hang

Expected Results:

You should be able to exit strace with CTRL-C and have it exit and the traced
program return to its normal status.

Additional Information:
	
This problem is not reproducible on a RedHat 7.2 system with multiple updates
from up2date applied.

This problem does not occur with strace-4.3-2 from RedHat 7.2

I listed the severity of this problem has HIGH because someone could cripple
their system unknowingly, or at least wind up having to forcibly kill system
processes, by running strace on some system daemons.  If you disagree, I'm sure
you will change the severity.

This problem has existed since skipjack beta-1, but I only reported it now
because I wanted to boil it down to a simplistic example and rule out other
possible causes of the problem.  I am now running skipjack beta-2 (installed
from scratch) with all current updates available from up2date applied.

Comment 1 Jay Berkenbilt 2002-04-13 00:26:04 UTC
Created attachment 53691 [details]
source code for simple example program

Comment 2 Bill Nottingham 2002-04-15 03:26:59 UTC

*** This bug has been marked as a duplicate of 62591 ***

Comment 3 Jay Berkenbilt 2002-05-16 16:44:07 UTC
I have reopened this bug, changed it to a RedHat 7.3 bug (rather than a skipjack
bug), and set the severity to normal instead of high.  Although this bug was
marked as a duplicate of bug 62591, it really isn't.  The problem reported there
does indeed appear to have been fixed by strace-4.4-4, but the problem reported
here has not been entirely fixed, though the nature of the problem has changed.
 Before, I had marked this as high severity because it was possible to
unknowingly disable the system with this problem.  That is no longer possible
though as the nature of what goes wrong has changed.  For this reason, I have
dropped the severity back down to normal.

The behavior now is that one or more of the traced processes may have their
status changed to T, but a kill -CONT fixes the problem and lets things continue
where they left off.  For example, again run my C program in the background and
run strace -f -p pid where pid is the primary process assigned to the job as
returned by jobs -l.

It is now possible to get out of strace with CTRL-C.  When you do, one of the
child processes will be left with 'T' as its status.  If you kill -CONT that
process, its parent gets left with 'T' as its status.  If you kill -CONT that
process, everything is back to normal.  A kill -CONT to the job from the shell
from which it was started (propagating the signal to the process group) should
also work.

I think the specifics are that whatever process strace is printing information
about at the time that it is interrupted remains STOPped, at least if the
process is in certain states.

Let me give a more exact recipe for reproducing the problem.

Start my program in the background:

$ ./a.out &
[1] 4790

Now, in another window, type strace -f -p 4790.

You should see plenty output that more or less alternates between 4790 and each
new child process.  Wait until a child process enters nanosleep().  As soon as
it does, hit CTRL-C on the strace process.

Do ps lpid where pid is the child process that was in nanosleep.  You'll see
that it has T as its status.  Send kill -CONT to it.  Everything may return to
normal, or the parent process may be stopped.  Run strace again.  This time,
CTRL-C when the parent is in nanosleep.  The same thing happens -- the parent
process ends up STOPped.  Hit return in the window with the shell that
originally started the program. You'll see

[1]+  Stopped           ./a.out

Run kill -CONT %, and everything again returns to normal.

I do not believe that interrupting strace should leave the child process stopped.



Comment 4 Trond Eivind Glomsrxd 2002-05-23 14:58:29 UTC
*** Bug 64560 has been marked as a duplicate of this bug. ***

Comment 5 Jay Berkenbilt 2002-08-06 01:49:55 UTC
This problem still exists in limbo.  Is anyone investigating?

Comment 6 Jay Berkenbilt 2002-08-09 14:50:15 UTC
I'm going to post a new bug for the one small aspect of this that is still
around.  Then maybe this one should be closed again.

Comment 7 Jay Berkenbilt 2002-08-09 15:06:33 UTC
Since I reopened this bug and nothing happened, I am taking the liberty of
closing it again.  I have reported the one remaining problem in bug 71166.