Bug 168775 - wait() and waitpid() return inconsistencies under high load
Summary: wait() and waitpid() return inconsistencies under high load
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Kernel Maintainer List
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 168429
TreeView+ depends on / blocked
 
Reported: 2005-09-20 09:59 UTC by Ionut Leonte
Modified: 2007-11-30 22:07 UTC (History)
3 users (show)

Fixed In Version: RHSA-2006-0132
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-07 20:08:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vfork() a child which fork()'s another (3.01 KB, text/plain)
2005-09-20 10:02 UTC, Ionut Leonte
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:808 0 normal SHIPPED_LIVE Important: kernel security update 2005-10-27 04:00:00 UTC
Red Hat Product Errata RHSA-2006:0132 0 qe-ready SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 3 2006-03-09 16:31:00 UTC

Description Ionut Leonte 2005-09-20 09:59:19 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

Description of problem:
On certain ocasions the wait() and waitpid() calls erroneously generate an ECHILD (No child processes)
error when in fact a child process was successfuly (v)fork()-ed and is running.




Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-11.EL

How reproducible:
Always

Steps to Reproduce:
1. compile the atttached program:
        gcc -W -Wall -g3 -O0 -o test-bin forktest.c -lpthread

2. create a script called 'test-thread' with the following contents:
        <-------------------- CUT HERE ----------------------->
        #/bin/bash
        for i in `seq 1000`; do ./test-bin 1 ; done
        <--------------------- END CUT ----------------------->

3. create a second script called 'test-main' with the following contents:
        <-------------------- CUT HERE ----------------------->
        #!/bin/bash

        for i in `seq 20`; do
            ./test-thread &
        done

        read ABCD

        killall -9 test-thread
        killall -9 test-bin
        <--------------------- END CUT ----------------------->

4. execute the following command:
        ./test-main | grep ">"


Actual Results:  .......................................................................
>>>>>>>>>>>>>>>>>>>>>>>>>> run_level1() waitpid surprise: No child processes
>>>>>>>>>>>>>>>>>>>>>>>>>> kill( [PID_OF_CHILD], 0 ): 0 Success
...................... (repeated) .....................................


Expected Results:  The output of 'test-main' should not contain any lines with '>' characters...

Additional info:

1. the call to kill() is always successful, thus confirming that the process exists and is running

2. the problem seems to only occur under high load. The speed at which processes are spawned also seems to be an important factor: the slower the rate the harder it is to reproduce the problem

The previous kernel version (kernel-smp-2.6.9-5.EL) does not seem to be affected by this issue.

Comment 1 Ionut Leonte 2005-09-20 10:02:51 UTC
Created attachment 119018 [details]
vfork() a child which fork()'s another

note the (sometimes) incorrect behaviour of waitpid() in run_level1()

Comment 4 Jason Baron 2005-10-10 19:10:03 UTC
thanks for the test case. this looks a lot like: bug 166454. In fact i'm going
to proactively dup it. We can un-dup it later, if i'm wrong.

*** This bug has been marked as a duplicate of 166454 ***

Comment 5 Mihai Maties 2005-10-11 11:21:42 UTC
We did some more testing (at BitDefender) using the Fedora Development kernels    
and found out that the last kernel version we tried did not have this bug.   
Unfortunately I do not remember the precise version of the kernel from FC 
Devel we used, but I can give you a hint: it was released in the same period 
of time we submitted this bug.   
 

Comment 6 Jason Baron 2005-10-12 17:13:44 UTC
Can you please try -22.3 at: http://people.redhat.com/~jbaron/rhel4/ thanks.

Comment 7 Mihai Maties 2005-10-13 13:32:20 UTC
I can confirm that the bug is gone in the -22.3 release. 

Comment 9 Red Hat Bugzilla 2006-03-07 20:08:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html



Note You need to log in before you can comment on or make changes to this bug.