The non-main task's exited waitpid status gets lost. In the below, the exiting status is seen but nothing further. 1998.1999: received signal 7 (Bus error) 1998.1999: exit this thread 5-Dec-06 12:23:35 AM frysk.proc.LinuxHost$PollWaitOnSigChld execute FINE: {frysk.proc.LinuxHost$PollWaitOnSigChld@1dcc60,sig=Sig_CHLD} execute 5-Dec-06 12:23:35 AM frysk.sys.Wait waitAllNoHang FINE: frysk.sys.Wait pid 1999 status 0x6057f WIFSTOPPED/EXIT 5 (Trace/breakpoint trap) 5-Dec-06 12:23:35 AM frysk.sys.Wait waitAllNoHang FINE: frysk.sys.Wait pid 0 errno 0 (Success) 5-Dec-06 12:23:35 AM frysk.proc.LinuxHost$PollWaitOnSigChld$5 getTask FINE: {TaskId,1999} exitEvent 5-Dec-06 12:23:35 AM frysk.proc.Host get FINE: {frysk.proc.LinuxHost@13a0c0,state=running} get TaskId 5-Dec-06 12:23:35 AM frysk.proc.TaskState$Running handleTerminatingEvent FINE: {frysk.proc.LinuxTask@20be70,pid=1998,tid=1999,state=running} handleTerminatingEvent 5-Dec-06 12:23:35 AM frysk.proc.LinuxTask sendContinue FINE: {frysk.proc.LinuxTask@20be70,pid=1998,tid=1999,state=running} sendContinue 5-Dec-06 12:23:35 AM frysk.proc.LinuxHost$PollWaitOnSigChld execute FINE: {frysk.proc.LinuxHost$PollWaitOnSigChld@1dcc60,sig=Sig_CHLD} execute 5-Dec-06 12:23:35 AM frysk.sys.Wait waitAllNoHang FINE: frysk.sys.Wait pid 0 errno 0 (Success) 5-Dec-06 12:23:40 AM frysk.event.EventLoop$2$Timeout execute FINE: {{frysk.event.EventLoop$2$Timeout@28ded8,timeMillis=1165296220425,periodMillis=0},expiredfalse} execute contrast this with a working trace: 11831.11832: received signal 7 (Bus error) 11831.11832: exit this thread 5-Dec-06 12:32:56 AM frysk.proc.LinuxHost$PollWaitOnSigChld execute FINE: {frysk.proc.LinuxHost$PollWaitOnSigChld@2d5880,sig=Sig_CHLD} execute 5-Dec-06 12:32:56 AM frysk.sys.Wait waitAllNoHang FINE: frysk.sys.Wait pid 11832 status 0x6057f WIFSTOPPED/EXIT 5 (Trace/breakpoint trap) 5-Dec-06 12:32:56 AM frysk.sys.Wait waitAllNoHang FINE: frysk.sys.Wait pid 0 errno 0 (Success) 5-Dec-06 12:32:56 AM frysk.proc.LinuxHost$PollWaitOnSigChld$5 getTask FINE: {TaskId,11832} exitEvent 5-Dec-06 12:32:56 AM frysk.proc.Host get FINE: {frysk.proc.LinuxHost@2176c0,state=running} get TaskId 5-Dec-06 12:32:56 AM frysk.proc.TaskState$Running handleTerminatingEvent FINE: {frysk.proc.LinuxTask@21baf0,pid=11831,tid=11832,state=running} handleTerminatingEvent 5-Dec-06 12:32:56 AM frysk.proc.LinuxTask sendContinue FINE: {frysk.proc.LinuxTask@21baf0,pid=11831,tid=11832,state=running} sendContinue 5-Dec-06 12:32:56 AM frysk.proc.LinuxHost$PollWaitOnSigChld execute FINE: {frysk.proc.LinuxHost$PollWaitOnSigChld@2d5880,sig=Sig_CHLD} execute 5-Dec-06 12:32:56 AM frysk.sys.Wait waitAllNoHang FINE: frysk.sys.Wait pid 11832 status 0x0 WIFEXITED 0 (exit status) 5-Dec-06 12:32:56 AM frysk.sys.Wait waitAllNoHang FINE: frysk.sys.Wait pid 0 errno 0 (Success) 5-Dec-06 12:32:56 AM frysk.proc.LinuxHost$PollWaitOnSigChld$5 getTask FINE: {TaskId,11832} terminated
Frysk bug: http://sourceware.org/bugzilla/show_bug.cgi?id=3486
(In reply to comment #2) > appears to show only WNOHANG calls. > that is racy. after SIGCHLD, some short period may pass before wait succeeds. > your guarantee is that a blocking wait will block a very short time, not that a > WNOHANG wait will succeed immediately. Que?
Was POSIX documentation explaining SIGCHLD and its querks with waitpid ever located? The assumption that SIGCHLD is always posted after the wait status was recorded - i.e., SIGIO behavior - is wrong? Does: -> SIGCHLD remain pending when waitpid events are pending; allowing one waitpid read per signal to work? -> SIGCHLD get withdrawn when all waiptpid events have been consumed; allowing more efficient draining of waitpid events? Testing shows that at least the second isn't true and the first, given that the signal is not counting, likely isn't either.
Rwrite to frysk's event-loop to use a blocking waitpid call will prevent problem of occasional hangs when monitoring a process. New code currently being tested upstream. Testing included in frysk's testsuite.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Fixes committed upstream, note that two tests - testCloneThanKillAttached and testDeleteAttached have been enabled in the testsuite and are now expected to pass. Index: frysk-core/frysk/proc/ChangeLog 2007-04-09 Andrew Cagney <cagney> * TestProcTasksObserver.java (testCloneThenKillAttached) (testDeleteAttached): Remove brokenIfUtraceXXX due to 3486. * Manager.java (usePoll): Set to false, enable WaitEventLoop. Index: frysk-imports/frysk/sys/ChangeLog 2007-04-09 Andrew Cagney <cagney> * cni/Wait.cxx (log): Add "logger" parameter, update calls. (waitForEvent): Delete. (waitAll): Use "log". Replace loop calling waitForEvent with multiple waitpid calls.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2007-0592.html