Bug 166669

Summary: [RHEL3 U5] waitpid() returns unexpected ECHILD
Product: Red Hat Enterprise Linux 3 Reporter: Issue Tracker <tao>
Component: kernelAssignee: Ingo Molnar <mingo>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: lwang, maeno.masaki, pcormier, petrides, riek, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0144 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-05-08 11:59:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
test case for the bug
none
A simple test case from IT 72214 none

Description Issue Tracker 2005-08-24 14:48:39 UTC
Escalated to Bugzilla from IssueTracker

Comment 3 Jatin Nansi 2005-08-24 14:52:02 UTC
Created attachment 118063 [details]
test case for the bug

Comment 8 Wendy Cheng 2005-09-13 16:26:22 UTC
Created attachment 118759 [details]
A simple test case from IT 72214

1) Compile with "gcc -O2 -o killipf killipf.c -lpthread"
2) Run with "PASS=0; while ./killipf; do let PASS=++PASS; echo $PASS; done"
3) Program dies occasionally with "waitpid failed!: No child processes" 
4) With the Fujitsu patch, the program loops correctly forever with output:
child pid:xxxxx
Status returned:9
PASS : received expected signal 9

Comment 12 Ernie Petrides 2005-10-08 02:11:35 UTC
A fix for this problem has just been committed to the RHEL3 U7
patch pool this evening (in kernel version 2.4.21-37.5.EL).


Comment 19 Masaki MAENO 2005-12-01 09:59:58 UTC
Please give me the beta-kernel (kernel-2.4.21-37.EL --> kernel-2.4.21-37.5.EL) 
included the patch.

We encounter the problem that looks like. We try to use the patch.


Comment 20 Masaki MAENO 2005-12-05 01:21:45 UTC
I found kernel-2.4.21-37.6.EL in ftp.pbone.net.
We try to use kernel-2.4.21-37.6.EL.src.rpm.

ftp://ftp.pbone.net/mirror/people-redhat.com/zaitcev/171129/kernel-2.4.21-
37.6.EL.bz171129.1.src.rpm

Comment 23 Masaki MAENO 2006-01-31 04:14:34 UTC
$ cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <errno.h>
#include <unistd.h>
 
int main(int argc, char *argv[])
{
    int i, rc;
 
    signal(SIGCHLD, SIG_IGN);
 
    for (i=0; i<100; i++) {
        //rc = system("/bin/zcat a.gz > b.txt");
        rc = system("ls -a");
        fprintf(stderr, "system result[%3d] = %x errno=[%3d]", i, rc, 
errno);        perror("system");
        errno = 0;
    }
 
    return 0;
}

$ gcc -o test test.c
$ ./test 2>&1 | tee test.txt
rt_sigaction(SIGINT, {SIG_IGN}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_IGN}, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, 
parent_tidptr=0xbfffcad8) = 13409
.
..
ignore.txt
ignore2
ignore2.c
ignore4
ignore4.c
ignore_st.log
waitpid(13409, 0xbfffcad4, 0)           = -1 ECHILD (No child processes)
rt_sigaction(SIGINT, {SIG_DFL}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(2, "system result[ 97] = ffffffff er"..., 41system result[ 97] = ffffffff 
errno=[ 10]) = 41
write(2, "system: No child processes\n", 27system: No child processes
) = 27
rt_sigaction(SIGINT, {SIG_IGN}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_IGN}, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, 
parent_tidptr=0xbfffcad8) = 13410
waitpid(13410, .
..
ignore.txt
ignore2
ignore2.c
ignore4
ignore4.c
ignore_st.log
[WIFEXITED(s) && WEXITSTATUS(s) == 0], 0) = 13410
rt_sigaction(SIGINT, {SIG_DFL}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
write(2, "system result[ 98] = 0 errno=[  "..., 34system result[ 98] = 0 errno=
[  0]) = 34
write(2, "system: Success\n", 16system: Success
)       = 16
rt_sigaction(SIGINT, {SIG_IGN}, {SIG_DFL}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_IGN}, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, 
parent_tidptr=0xbfffcad8) = 13411
waitpid(13411, .
..
ignore.txt
ignore2
ignore2.c
ignore4
ignore4.c
ignore_st.log
[WIFEXITED(s) && WEXITSTATUS(s) == 0], 0) = 13411
rt_sigaction(SIGINT, {SIG_DFL}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
write(2, "system result[ 99] = 0 errno=[  "..., 34system result[ 99] = 0 errno=
[  0]) = 34
write(2, "system: Success\n", 16system: Success
)       = 16
exit_group(0)                           = ?
$

If waitpid() is executed before child process is executed, return value of 
waitpid() is 0.
Otherwise, return value of waitpid() is error, errno is ECHILD.

This is contradicts SingleUNIXSpecification: 
http://www.opengroup.org/onlinepubs/009695399/functions/waitpid.html
======
If the calling process has SA_NOCLDWAIT set or has SIGCHLD set to SIG_IGN, and 
the process has no unwaited-for children that were transformed into zombie 
processes, the calling thread shall block until all of the children of the 
process containing the calling thread terminate, and wait() and waitpid() shall 
fail and set errno to [ECHILD]. 
======

- kernel2.6 is always ECHILD. It conforms to SingleUNIXSpecification.
- kernel2.4 is always 0 if kernel2.4 is not implemented. It doesn't conform to 
its specification.
- Only RHEL3-kernel is 0 and ECHILD, is not constant. 
  RHEL3-kernel is an imperfect backing port. 


Comment 29 Red Hat Bugzilla 2006-03-15 16:29:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0144.html


Comment 33 Masaki MAENO 2006-03-17 07:58:38 UTC
> You may reopen this bug report if the solution does not work for you.
We confirmed that the problem reproduced on RHEL3U7(kernel-2.4.21-40.ELsmp) 
intalled newly.

(a) "Only" RHEL3's kernel is 0 and ECHILD, is not constant. 
    RHEL3-kernel is an imperfect backing port from kernel2.6. 
(b) kernel2.6 is always ECHILD. It conforms to SingleUNIXSpecification.
(c) kernel2.4 is always 0 if kernel2.4 is not implemented. It doesn't conform 
   to its specification.

We are convinced that there is the problem in wait() kernel-function.
Please confirm this problem on RHEL3U7(=not constant) and RHEL4U3(=always 
ECHILD) by RedHat's engineer executes test.c(#23 or below)

$ cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <errno.h>
#include <unistd.h>
 
int main(int argc, char *argv[])
{
    int i, rc;
 
    signal(SIGCHLD, SIG_IGN);
 
    for (i=0; i<100; i++) {
        //rc = system("/bin/zcat a.gz > b.txt");
        rc = system("ls -a");
        fprintf(stderr, "system result[%3d] = %x errno=[%3d]", i, rc, 
errno);        perror("system");
        errno = 0;
    }
 
    return 0;
}

$ gcc -o test test.c

$ ./test 2>&1 | tee test01.txt
$ ./test 2>&1 | tee test02.txt
$ ./test 2>&1 | tee test03.txt
$ ./test 2>&1 | tee test04.txt
    .....
 ^
 |
simultaneous multiple execution (about 10 - 10000)


Comment 34 Masaki MAENO 2006-03-17 08:04:46 UTC
Please reopen this issue, and read #23 and #33 well and understand its problem.

Comment 37 Tim Burke 2006-05-08 11:59:47 UTC
The codepath in which this bug occurs is known to be a very sensitive and
fragile section of code.  Given where the RHEL3 product is in its lifecycle we
are not receptive to making changes in this space. The risk of introducing
regressions, or incompatibility issues is too high.  Consequently, closing this
issue as wontfix.

This problem does not exist in RHEL4.