Bug 8266

Summary: File descriptor loss: bash command substitution
Product: [Retired] Red Hat Linux Reporter: Harold Knudsen <hkk>
Component: bashAssignee: Michael K. Johnson <johnsonm>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 6.0CC: linuxcub
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2001-10-26 06:12:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
System info and Addendum#1
none
Possible work-around for bug in bash (or libc or ??).
none
Explanation of my previously attached patch
none
More information indicating the underlying (actual) bug none

Description Harold Knudsen 2000-01-07 16:43:27 UTC
A repeatable, but relatively rare, error occurs on my system (Dell 410
Workstation, Pentium III, 450Mz, TAG #QAQK) when executing a bash
command substitution, e.g., LN=$(/bin/echo "abc").  The error, reported
from bash (subst.c:2533) is: `Can't reopen pipe to command substitution
(fd 4): No child processes'.
The following scripts have been and are being used to gather statistics
on the frequency of occurrence of this error.
------------------------------------------------------------
#! /bin/sh
# nnx - script that generates/detects bash command substitution error
# $1 = error count file
COUNT=0
while true
do
   C1=0
   until [ $C1 -gt 999 ]
   do
      LN=$(/bin/echo "abc") # The command substitution
      if [ -z $LN ] # Empty string returned on command substitution error
      then
        echo "$COUNT$C1" >> $1
        echo 2
      fi
      C1=$[$C1 = 1 ]
   done
   COUNT=$[$COUNT + 1]
done

---------------------------------------------------------------------
#! /bin/sh
# nxxd - driver for nxx
# $1 = error_count file
# runs until terminated with ^C
while true
do
  nnx $1
done

---------------------------------------------------------------------
Typical use is: `nxxd error_count_file &'
When run under kernel-2.2.5-15 the average command substitution error
frequency was found to be 1 in 119383. (average of the counts in the
error_count_file (22 samples).

Variations on the experiment:
1. I have also installed kernel-2.2.12-20 (from RedHat 6.1) to see if
   the problem exists there.  It does, and with increased freguency (1 in
   52169, on 21 samples).
2. I have explored possible timing sensitivities by placing a delay loop
   in the nxx script, and rerunning it under kernel-2.2.12-20.  See
   ADDENDUM#1, below, for the code (d_nxx).  The frequency of error is
   reduced about ten-fold (from 1 in 52169 to 1 in 587466).
   Increasing the delay count (changing 19 to 49 in `until [ $j -gt 19 ]')
   appears to even further reduce the error frequency.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CONJECTURES and CONCLUSION:
1. A software race (compiler induced?) exists in the control of
/proc/self/fd.  This race is probably not in bash and likely to be
   in the kernel.
2. The fact that the added delay (item 2. above) decreases the frequency
   of the error makes hardware induced failure less likely.

It seems important to test this on another machine running the same
software (something I don't have easy access to).

The problem, if not limited to my system, is serious--most bash (sh)
scripts use command substitution and correct system operation
depends on the correct operation of many scripts.

Please let me know if I can supply other information in helping to
solve this problem.
Thank you,
Harold Knudsen, Emeritus Professor, Computer Science,
University of New Mexico

Comment 1 Harold Knudsen 2000-01-07 16:48:59 UTC
Created attachment 49 [details]
System info and Addendum#1

Comment 2 Damien Miller 2000-08-11 02:59:24 UTC
Here are some more data points:

Both are on RH6.2 with all errata updates

Kernel 2.2.16-3. P-III 700 128Mb RAM
bash-1.14.7-22.: 1 failure in 6,283,000 substitutions

Kernel 2.2.16-3 rebuilt with advanced routing enabled (but not used). Celeron
400 128Mb RAM
bash-1.14.7-22.: 22 failures in 3,600,000 substitutions
bash2-2.03-8: 0 failures in 4,640,000 tests

---------

This bug is *very* annoying when it occurs on long, unattended software builds.
It drives me near insane when an overnight build stalls during a kernel build or
an autoconf run is messed up resulting in miscompiled software.


Comment 3 Erling Jacobsen 2001-10-24 22:37:05 UTC
Created attachment 34962 [details]
Possible work-around for bug in bash (or libc or ??).

Comment 4 Erling Jacobsen 2001-10-25 09:52:44 UTC
Created attachment 35021 [details]
Explanation of my previously attached patch

Comment 5 Erling Jacobsen 2001-10-25 11:29:18 UTC
Created attachment 35041 [details]
More information indicating the underlying (actual) bug

Comment 6 paulh 2001-10-26 06:12:19 UTC
See bug 14781. I had similar problems with my nightly compile. However thiis 
error has disappeared since RedHat 7.0



Comment 7 Phil Knirsch 2002-07-23 08:43:24 UTC
*** Bug 12184 has been marked as a duplicate of this bug. ***