Bug 212111 - make-3.81-1.1 isn't parallel build safe
make-3.81-1.1 isn't parallel build safe
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: make (Show other bugs)
6
All Linux
medium Severity high
: ---
: ---
Assigned To: Petr Machata
Brian Brock
bzcl34nup
:
Depends On:
Blocks: 211290 418441
  Show dependency treegraph
 
Reported: 2006-10-25 00:46 EDT by H.J. Lu
Modified: 2015-05-04 21:32 EDT (History)
5 users (show)

See Also:
Fixed In Version: F-8
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-04-04 09:28:14 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
A patch (1015 bytes, patch)
2006-10-26 14:10 EDT, H.J. Lu
no flags Details | Diff
An updated patch (1.17 KB, text/x-patch)
2006-10-26 16:05 EDT, H.J. Lu
no flags Details
A new patch (5.49 KB, patch)
2006-10-27 09:09 EDT, H.J. Lu
no flags Details | Diff
Fix for this problem. (515 bytes, patch)
2007-09-24 09:43 EDT, Petr Machata
no flags Details | Diff

  None (edit)
Description H.J. Lu 2006-10-25 00:46:33 EDT
When I was building glibc 2.5 from CVS with "make -j4" on dual processor
machines, I got

make[3]: *** No rule to make target
`/export/build/gnu/glibc-nptl-local/build-x86_64-linux/iconv/charmap.o', needed
by `others'.  Stop.
make[3]: *** Waiting for unfinished jobs....
make[3]: Leaving directory `/export/gnu/src/glibc/libc/iconv'
make[2]: *** [iconv/others] Error 2
make[2]: Leaving directory `/export/gnu/src/glibc/libc'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/export/build/gnu/glibc-nptl-local/build-x86_64-linux'

"make -j1" worked fine. make-3.80-10.2 from FC5 has no problem with "make -j4".
It happens on both x86 and x86-64.
Comment 1 Petr Machata 2006-10-25 10:52:43 EDT
Hmm, so it's reproducible, right? 
Unfortunately 3.81-1.1 seems to do a good job for me so far, gave it a shot at
several x86-based machines and it just works.
Comment 2 H.J. Lu 2006-10-25 12:12:54 EDT
I have seen this problem on 2 processor x86, x86-64 and ia64 machines when
I use "make -j4" to build glibc 2.5. How did you do build glibc? Can you
show me your /proc/cpuinfo?
Comment 3 H.J. Lu 2006-10-25 14:49:07 EDT
I even saw it with "make -j2" on a single processor x86 machine.
Comment 4 H.J. Lu 2006-10-25 16:44:25 EDT
The command I used is

make -jN PARALLELMFLAGS="-jN"

where N == 2 x NUM_OF_CPUs.
Comment 5 H.J. Lu 2006-10-25 20:39:12 EDT
To reproduce it, after glibc build is done, in glibc build directory:

[hjl@gnu-25 build-x86_64-linux]$ rm -rf iconv
[hjl@gnu-25 build-x86_64-linux]$ make  -j4 PARALLELMFLAGS=-j4 > make.log
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
mkdir /export/build/gnu/glibc-nptl-local/build-x86_64-linux/iconv
gconv_open.c: In function ‘__gconv_open’:
gconv_open.c:59: warning: ‘ptr’ may be used uninitialized in this function
gconv_open.c: In function ‘__gconv_open’:
gconv_open.c:59: warning: ‘ptr’ may be used uninitialized in this function
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
make[2]: warning: -jN forced in submake: disabling jobserver mode.
No rule to make target
`/export/build/gnu/glibc-nptl-local/build-x86_64-linux/iconv/charmap.o', needed
by `others'
make[1]: *** [iconv/others] Aborted (core dumped)
make: *** [all] Error 2
[hjl@gnu-25 build-x86_64-linux]$
Comment 6 H.J. Lu 2006-10-26 14:10:34 EDT
Created attachment 139496 [details]
A patch

The problem is when start_job_command closes job_fds, it doesn't set
them to -1. Then the same fd is returned by opendir. Later it is used
for pipe again. From there, everything goes down hill.
Comment 7 H.J. Lu 2006-10-26 16:05:08 EDT
Created attachment 139522 [details]
An updated patch

clean_jobserver should also set job_fds to -1 after closing them.
Comment 8 H.J. Lu 2006-10-26 19:09:30 EDT
My patch breaks job server. I am looking into it now.
Comment 9 Petr Machata 2006-10-27 07:20:19 EDT
Got it reproduced now, the trick was the PARALLELMFLAGS="-jN" part, I think.
Investigating.
Comment 10 H.J. Lu 2006-10-27 09:09:45 EDT
Created attachment 139578 [details]
A new patch

Here is the patch to fix. When make re-execs itself, it calls
clean_jobserver which may close job_fds.  After it is re-execed,
it reads job_fds from jobserver_fds again and closes them when
jobserver mode is disabled.  It closes the same fd twice. The
second time it closes the wrong file. This patch sets
jobserver_fds_invalid_flag after closing job_fds and checks it
before closing job_fds.
Comment 11 Petr Machata 2006-10-27 11:41:51 EDT
Nice work, many thanks!
Comment 12 Paul Smith 2006-10-29 01:02:58 EST
I need to understand the situation better.  I'm not convinced that the patch
provided is actually correct.  The only time clean_jobserver() closes the FDs is
when it is invoked by the master (top-level) make instance.  The master make
instance never has the --jobserver-fds flag in its command line, since that is
an internal flag that is added by make itself when it invokes sub-makes.  When
the master make instance finishes building makefiles and re-execs itself, it
considers itself as a brand new instance of make (which it is) and re-opens the
jobserver pipes and hands those new values to its children.  I've annotated the
source and done some tests and verified all these things.

There must be something else unexpected going on here if the problem is as you
describe it.  I can't find a "normal" code path the results in the behavior
described in comment #10.

I rather suspect it has something to do with the PARALLELMFLAGS="-jN" variable.
 How is this variable actually used in the makefiles?
Comment 13 H.J. Lu 2006-10-29 16:03:51 EST
Make handles error conditions poorly, which hides the real problem and
results in misleading error messages. In this particular case, we have

#define ENULLLOOP(_v,_c)   do{ errno = 0; \
                               while (((_v)=_c)==0 && errno==EINTR); }while(0)


      ENULLLOOP (d, readdir (dir->dirstream));
      if (d == 0)
        break;

Because this bug under discussion, dir->dirstream is now referring to a pipe
instead of a directory. But ENULLLOOP doesn't check ENOTDIR at all. Better
error handling will make it easier to identify where the real problem is. At
least it should do

     if (d == 0)
       {
          assert (errno == 0);
          break;
       }
Comment 14 Paul Smith 2006-10-30 07:37:44 EST
The purpose of these special loops is to try to work properly on systems where
SA_RESTART does not work universally (Solaris, for example, is such a system). 
On those systems virtually any system call can be interrupted, so this loop is
intended to mask that in the code.  The macros are _not_ supposed to handle
general error conditions beyond EINTR.  However, you're absolutely correct that
the code after the macro should look for error conditions.

I still fail to understand exactly what the bug is.  It seems that someone must
understand what's going on since a patch has been produced, but there is no
description of what's happening and it's not obvious (to me) from the patch.  I
_especially_ don't understand how this dir.c code relates to the original bug,
and how dir->dirstream could be referring to a pipe rather than a directory (!)
 If that's true there's something _really_ wrong internally, as far as I can
make out.  If someone could jot down even a high-level description of how the
bug is triggered that would help me a lot.  Cheers!
Comment 15 H.J. Lu 2006-10-30 09:50:19 EST
I can't reproduce it with a simple testcase. I can only see it when I was
building glibc with "make -jN PARALLELMFLAGS=-jN" on both single processor
and multiple processor machines. Glibc Makefiles calls make with

$(MAKE) $(PARALLELMFLAGS) ...

and

$(MAKE) -r PARALLELMFLAGS="$(PARALLELMFLAGS)" ...

Basically, every single make invocation is called with -jN. Do you have
a Linux machine to build glibc?
Comment 16 Paul Smith 2006-10-31 08:50:38 EST
Hm.  With GNU make 3.81beta4, which comes on my Ubuntu Dapper box, the bug
doesn't happen.  But with the latest CVS version of GNU make, I can reproduce
it.  I'll look into it.
Comment 17 starlight 2006-11-01 14:19:16 EST
I'm chiming in for the time here.  Think the problem is this

http://www.gnu.org/software/make/manual/make.html#Archive-Pitfalls

Seems to me the solution would be to enhance make to recognize
that archive members are part of the same target.  I'm assuming
that no more than one job will be scheduled for the same target
at the same time in the current logic, but if not that would be
needed too.
Comment 18 starlight 2006-11-01 14:46:15 EST
BTW an interim work-around for this problem is to remove the 
implicit archive rule and replace it with a single explicit 
archive step for each archive.  These steps should not use the 
archive member syntax.  Since I'm not expecting a fix tomorrow,
that's what I'm doing.
Comment 19 Paul Smith 2006-11-02 01:07:36 EST
I really don't think that this problem is related at all to the archive issues.
 It works fine in beta4 and fails in the release.  The issue you discuss here
has existed forever; it's even documented in the manual to work this way. 
Further, make has no problem per se with this: it's just your output archive
that is bolloxed up.  In this bug, make actually can't figure out how to build
something.  If this problem is fixed by the changes you're making I can only
assume that it's a coincidence.

If we're talking about workarounds the best one is quite simple: stop adding
PARALLELMFLAGS=-jN to your make invocations!  The entire point of the jobserver
feature in GNU make is that all recursive instances share the same pool of jobs,
so forcing every sub-make to have N jobs kind of defeats the purpose.  Please
understand I'm not at all saying that this behavior is not a bug or that it
shouldn't be fixed.  I have some debug logs and I'll look at them.  I'm just
saying that if you use the jobserver as designed you won't see this failure.

Cheers!
Comment 20 starlight 2006-11-02 01:11:09 EST
Another problem is multiple targets.  See

http://docs.sun.com/source/806-3573/Dmake.html

GNU make lacks the "+" construct available in Sun 'dmake'.
I was able to work around it by placing all the multiple
target rules in a separate tree and running a separate
single-threaded pass for them.
Comment 21 Petr Machata 2007-09-24 09:20:21 EDT
(In reply to comment #12)
> When the master make instance finishes building makefiles and re-execs itself, 
> it considers itself as a brand new instance of make (which it is) and re-opens 
> the jobserver pipes and hands those new values to its children.  I've 
> annotated the source and done some tests and verified all these things.

The sequence of events is as follows:

start sub-make with -j2 on commandline and --jobserver-fds=3,4 in MAKEFLAGS
make opens lots of DIR streams (not important right now)
warning: -jN forced in submake: disabling jobserver mode => close(3); close(4)
open new job_fds through pipe (fds 3, 4 again, these fds are now free)
make decides it has to reexec
close(3); close(4) in clean_jobserver to clean up after ourselves

restart sub-make with -j2 on commandline and --jobserver-fds=3,4 in MAKEFLAGS
open lots of dirs, don't close one of them fd=4, because it's not exhausted yet
warning: -jN forced in submake: disabling jobserver mode => close(3); close(4)
 => oops! 3 is invalid, 4 refers to directory stream!
 => make[2]: *** No rule to make target etc. etc.

Does it make a sense?
Comment 22 Petr Machata 2007-09-24 09:43:13 EDT
Created attachment 204131 [details]
Fix for this problem.

The patch leaves whole make logic intact, but only actually calls the
problematic `close' if it's first iteration of make, i.e. when make restarts is
zero. Opinions? Does it break somewhere?
Comment 23 Bug Zapper 2008-04-04 00:05:37 EDT
Fedora apologizes that these issues have not been resolved yet. We're
sorry it's taken so long for your bug to be properly triaged and acted
on. We appreciate the time you took to report this issue and want to
make sure no important bugs slip through the cracks.

If you're currently running a version of Fedora Core between 1 and 6,
please note that Fedora no longer maintains these releases. We strongly
encourage you to upgrade to a current Fedora release. In order to
refocus our efforts as a project we are flagging all of the open bugs
for releases which are no longer maintained and closing them.
http://fedoraproject.org/wiki/LifeCycle/EOL

If this bug is still open against Fedora Core 1 through 6, thirty days
from now, it will be closed 'WONTFIX'. If you can reporduce this bug in
the latest Fedora version, please change to the respective version. If
you are unable to do this, please add a comment to this bug requesting
the change.

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we are following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

And if you'd like to join the bug triage team to help make things
better, check out http://fedoraproject.org/wiki/BugZappers
Comment 24 Petr Machata 2008-04-04 09:28:14 EDT
This is fixed in rawhide and F8.

Note You need to log in before you can comment on or make changes to this bug.