A repeatable, but relatively rare, error occurs on my system (Dell 410 Workstation, Pentium III, 450Mz, TAG #QAQK) when executing a bash command substitution, e.g., LN=$(/bin/echo "abc"). The error, reported from bash (subst.c:2533) is: `Can't reopen pipe to command substitution (fd 4): No child processes'. The following scripts have been and are being used to gather statistics on the frequency of occurrence of this error. ------------------------------------------------------------ #! /bin/sh # nnx - script that generates/detects bash command substitution error # $1 = error count file COUNT=0 while true do C1=0 until [ $C1 -gt 999 ] do LN=$(/bin/echo "abc") # The command substitution if [ -z $LN ] # Empty string returned on command substitution error then echo "$COUNT$C1" >> $1 echo 2 fi C1=$[$C1 = 1 ] done COUNT=$[$COUNT + 1] done --------------------------------------------------------------------- #! /bin/sh # nxxd - driver for nxx # $1 = error_count file # runs until terminated with ^C while true do nnx $1 done --------------------------------------------------------------------- Typical use is: `nxxd error_count_file &' When run under kernel-2.2.5-15 the average command substitution error frequency was found to be 1 in 119383. (average of the counts in the error_count_file (22 samples). Variations on the experiment: 1. I have also installed kernel-2.2.12-20 (from RedHat 6.1) to see if the problem exists there. It does, and with increased freguency (1 in 52169, on 21 samples). 2. I have explored possible timing sensitivities by placing a delay loop in the nxx script, and rerunning it under kernel-2.2.12-20. See ADDENDUM#1, below, for the code (d_nxx). The frequency of error is reduced about ten-fold (from 1 in 52169 to 1 in 587466). Increasing the delay count (changing 19 to 49 in `until [ $j -gt 19 ]') appears to even further reduce the error frequency. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ CONJECTURES and CONCLUSION: 1. A software race (compiler induced?) exists in the control of /proc/self/fd. This race is probably not in bash and likely to be in the kernel. 2. The fact that the added delay (item 2. above) decreases the frequency of the error makes hardware induced failure less likely. It seems important to test this on another machine running the same software (something I don't have easy access to). The problem, if not limited to my system, is serious--most bash (sh) scripts use command substitution and correct system operation depends on the correct operation of many scripts. Please let me know if I can supply other information in helping to solve this problem. Thank you, Harold Knudsen, Emeritus Professor, Computer Science, University of New Mexico
Created attachment 49 [details] System info and Addendum#1
Here are some more data points: Both are on RH6.2 with all errata updates Kernel 2.2.16-3. P-III 700 128Mb RAM bash-1.14.7-22.: 1 failure in 6,283,000 substitutions Kernel 2.2.16-3 rebuilt with advanced routing enabled (but not used). Celeron 400 128Mb RAM bash-1.14.7-22.: 22 failures in 3,600,000 substitutions bash2-2.03-8: 0 failures in 4,640,000 tests --------- This bug is *very* annoying when it occurs on long, unattended software builds. It drives me near insane when an overnight build stalls during a kernel build or an autoconf run is messed up resulting in miscompiled software.
Created attachment 34962 [details] Possible work-around for bug in bash (or libc or ??).
Created attachment 35021 [details] Explanation of my previously attached patch
Created attachment 35041 [details] More information indicating the underlying (actual) bug
See bug 14781. I had similar problems with my nightly compile. However thiis error has disappeared since RedHat 7.0
*** Bug 12184 has been marked as a duplicate of this bug. ***