Bug 218410

Summary:	non-main task's waitpid exited status lost when tracing
Product:	Red Hat Enterprise Linux 5	Reporter:	Andrew Cagney <cagney>
Component:	frysk	Assignee:	Andrew Cagney <cagney>
Status:	CLOSED ERRATA	QA Contact:	Len DiMaggio <ldimaggi>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5.0	CC:	kasal, mcvet, mjw, npremji, pmuldoon, rmoseley, roland, scox, timoore
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHEA-2007-0592	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-11-07 18:05:47 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	173278

Description Andrew Cagney 2006-12-05 05:46:28 UTC

The non-main task's exited waitpid status gets lost.  In the below, the exiting
status is seen but nothing further.

1998.1999: received signal 7 (Bus error)
1998.1999: exit this thread
5-Dec-06 12:23:35 AM frysk.proc.LinuxHost$PollWaitOnSigChld execute
FINE: {frysk.proc.LinuxHost$PollWaitOnSigChld@1dcc60,sig=Sig_CHLD} execute

5-Dec-06 12:23:35 AM frysk.sys.Wait waitAllNoHang
FINE: frysk.sys.Wait pid 1999 status 0x6057f WIFSTOPPED/EXIT 5 (Trace/breakpoint
trap)

5-Dec-06 12:23:35 AM frysk.sys.Wait waitAllNoHang
FINE: frysk.sys.Wait pid 0 errno 0 (Success)

5-Dec-06 12:23:35 AM frysk.proc.LinuxHost$PollWaitOnSigChld$5 getTask
FINE: {TaskId,1999} exitEvent

5-Dec-06 12:23:35 AM frysk.proc.Host get
FINE: {frysk.proc.LinuxHost@13a0c0,state=running} get TaskId

5-Dec-06 12:23:35 AM frysk.proc.TaskState$Running handleTerminatingEvent
FINE: {frysk.proc.LinuxTask@20be70,pid=1998,tid=1999,state=running}
handleTerminatingEvent

5-Dec-06 12:23:35 AM frysk.proc.LinuxTask sendContinue
FINE: {frysk.proc.LinuxTask@20be70,pid=1998,tid=1999,state=running} sendContinue

5-Dec-06 12:23:35 AM frysk.proc.LinuxHost$PollWaitOnSigChld execute
FINE: {frysk.proc.LinuxHost$PollWaitOnSigChld@1dcc60,sig=Sig_CHLD} execute

5-Dec-06 12:23:35 AM frysk.sys.Wait waitAllNoHang
FINE: frysk.sys.Wait pid 0 errno 0 (Success)

5-Dec-06 12:23:40 AM frysk.event.EventLoop$2$Timeout execute
FINE:
{{frysk.event.EventLoop$2$Timeout@28ded8,timeMillis=1165296220425,periodMillis=0},expiredfalse}
execute

contrast this with a working trace:

11831.11832: received signal 7 (Bus error)
11831.11832: exit this thread
5-Dec-06 12:32:56 AM frysk.proc.LinuxHost$PollWaitOnSigChld execute
FINE: {frysk.proc.LinuxHost$PollWaitOnSigChld@2d5880,sig=Sig_CHLD} execute

5-Dec-06 12:32:56 AM frysk.sys.Wait waitAllNoHang
FINE: frysk.sys.Wait pid 11832 status 0x6057f WIFSTOPPED/EXIT 5
(Trace/breakpoint trap)

5-Dec-06 12:32:56 AM frysk.sys.Wait waitAllNoHang
FINE: frysk.sys.Wait pid 0 errno 0 (Success)

5-Dec-06 12:32:56 AM frysk.proc.LinuxHost$PollWaitOnSigChld$5 getTask
FINE: {TaskId,11832} exitEvent

5-Dec-06 12:32:56 AM frysk.proc.Host get
FINE: {frysk.proc.LinuxHost@2176c0,state=running} get TaskId

5-Dec-06 12:32:56 AM frysk.proc.TaskState$Running handleTerminatingEvent
FINE: {frysk.proc.LinuxTask@21baf0,pid=11831,tid=11832,state=running}
handleTerminatingEvent

5-Dec-06 12:32:56 AM frysk.proc.LinuxTask sendContinue
FINE: {frysk.proc.LinuxTask@21baf0,pid=11831,tid=11832,state=running} sendContinue

5-Dec-06 12:32:56 AM frysk.proc.LinuxHost$PollWaitOnSigChld execute
FINE: {frysk.proc.LinuxHost$PollWaitOnSigChld@2d5880,sig=Sig_CHLD} execute

5-Dec-06 12:32:56 AM frysk.sys.Wait waitAllNoHang
FINE: frysk.sys.Wait pid 11832 status 0x0 WIFEXITED 0 (exit status)

5-Dec-06 12:32:56 AM frysk.sys.Wait waitAllNoHang
FINE: frysk.sys.Wait pid 0 errno 0 (Success)

5-Dec-06 12:32:56 AM frysk.proc.LinuxHost$PollWaitOnSigChld$5 getTask
FINE: {TaskId,11832} terminated

Comment 1 Andrew Cagney 2006-12-05 05:48:11 UTC

Frysk bug: http://sourceware.org/bugzilla/show_bug.cgi?id=3486

Comment 3 Andrew Cagney 2006-12-05 14:41:19 UTC

(In reply to comment #2)
> appears to show only WNOHANG calls.
> that is racy.  after SIGCHLD, some short period may pass before wait succeeds.
> your guarantee is that a blocking wait will block a very short time, not that a
> WNOHANG wait will succeed immediately.

Que?

Comment 5 Andrew Cagney 2007-03-23 21:36:51 UTC

Was POSIX documentation explaining SIGCHLD and its querks with waitpid ever located?

The assumption that SIGCHLD is always posted after the wait status was recorded
- i.e., SIGIO behavior - is wrong?

Does:

-> SIGCHLD remain pending when waitpid events are pending; allowing one waitpid
read per signal to work?

-> SIGCHLD get withdrawn when all waiptpid events have been consumed; allowing
more efficient draining of waitpid events?

Testing shows that at least the second isn't true and the first, given that the
signal is not counting, likely isn't either.

Comment 6 Andrew Cagney 2007-04-04 19:55:53 UTC

Rwrite to frysk's event-loop to use a blocking waitpid call will prevent problem
of occasional hangs when monitoring a process.  New code currently being tested
upstream.

Testing included in frysk's testsuite.

Comment 7 RHEL Program Management 2007-04-04 20:06:15 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 8 Andrew Cagney 2007-04-09 18:13:00 UTC

Fixes committed upstream, note that two tests - testCloneThanKillAttached and
testDeleteAttached have been enabled in the testsuite and are now expected to pass.

Index: frysk-core/frysk/proc/ChangeLog
2007-04-09  Andrew Cagney  <cagney>

        * TestProcTasksObserver.java (testCloneThenKillAttached)
        (testDeleteAttached): Remove brokenIfUtraceXXX due to 3486.
        * Manager.java (usePoll): Set to false, enable WaitEventLoop.

Index: frysk-imports/frysk/sys/ChangeLog
2007-04-09  Andrew Cagney  <cagney>

        * cni/Wait.cxx (log): Add "logger" parameter, update calls.
        (waitForEvent): Delete.
        (waitAll): Use "log".  Replace loop calling waitForEvent with
        multiple waitpid calls.

Comment 13 errata-xmlrpc 2007-11-07 18:05:47 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2007-0592.html