Bug 168775

Summary: wait() and waitpid() return inconsistencies under high load
Product: Red Hat Enterprise Linux 4 Reporter: Ionut Leonte <ileonte>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: drepper, mihai, mingo
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0132 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-03-07 20:08:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 168429    
Attachments:
Description Flags
vfork() a child which fork()'s another none

Description Ionut Leonte 2005-09-20 09:59:19 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

Description of problem:
On certain ocasions the wait() and waitpid() calls erroneously generate an ECHILD (No child processes)
error when in fact a child process was successfuly (v)fork()-ed and is running.




Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-11.EL

How reproducible:
Always

Steps to Reproduce:
1. compile the atttached program:
        gcc -W -Wall -g3 -O0 -o test-bin forktest.c -lpthread

2. create a script called 'test-thread' with the following contents:
        <-------------------- CUT HERE ----------------------->
        #/bin/bash
        for i in `seq 1000`; do ./test-bin 1 ; done
        <--------------------- END CUT ----------------------->

3. create a second script called 'test-main' with the following contents:
        <-------------------- CUT HERE ----------------------->
        #!/bin/bash

        for i in `seq 20`; do
            ./test-thread &
        done

        read ABCD

        killall -9 test-thread
        killall -9 test-bin
        <--------------------- END CUT ----------------------->

4. execute the following command:
        ./test-main | grep ">"


Actual Results:  .......................................................................
>>>>>>>>>>>>>>>>>>>>>>>>>> run_level1() waitpid surprise: No child processes
>>>>>>>>>>>>>>>>>>>>>>>>>> kill( [PID_OF_CHILD], 0 ): 0 Success
...................... (repeated) .....................................


Expected Results:  The output of 'test-main' should not contain any lines with '>' characters...

Additional info:

1. the call to kill() is always successful, thus confirming that the process exists and is running

2. the problem seems to only occur under high load. The speed at which processes are spawned also seems to be an important factor: the slower the rate the harder it is to reproduce the problem

The previous kernel version (kernel-smp-2.6.9-5.EL) does not seem to be affected by this issue.

Comment 1 Ionut Leonte 2005-09-20 10:02:51 UTC
Created attachment 119018 [details]
vfork() a child which fork()'s another

note the (sometimes) incorrect behaviour of waitpid() in run_level1()

Comment 4 Jason Baron 2005-10-10 19:10:03 UTC
thanks for the test case. this looks a lot like: bug 166454. In fact i'm going
to proactively dup it. We can un-dup it later, if i'm wrong.

*** This bug has been marked as a duplicate of 166454 ***

Comment 5 Mihai Maties 2005-10-11 11:21:42 UTC
We did some more testing (at BitDefender) using the Fedora Development kernels    
and found out that the last kernel version we tried did not have this bug.   
Unfortunately I do not remember the precise version of the kernel from FC 
Devel we used, but I can give you a hint: it was released in the same period 
of time we submitted this bug.   
 

Comment 6 Jason Baron 2005-10-12 17:13:44 UTC
Can you please try -22.3 at: http://people.redhat.com/~jbaron/rhel4/ thanks.

Comment 7 Mihai Maties 2005-10-13 13:32:20 UTC
I can confirm that the bug is gone in the -22.3 release. 

Comment 9 Red Hat Bugzilla 2006-03-07 20:08:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html