Bug 654822

Summary: GNU make hanging at end of build
Product: Red Hat Enterprise Linux 6 Reporter: Anthony Green <green>
Component: makeAssignee: Petr Machata <pmachata>
Status: CLOSED NOTABUG QA Contact: qe-baseos-tools-bugs
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: mnewsome
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-11-25 12:10:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
srpm that won't finish building none

Description Anthony Green 2010-11-18 20:11:21 UTC
Description of problem:
I'm porting a package from Fedora to EPEL.  However, when I build the attached srpm on RHEL 6.0 it appears as through GNU make is hanging on a read (based on strace output).

Version-Release number of selected component (if applicable):
make-3.81-19.el6.x86_64

How reproducible:
Always

Steps to Reproduce:
1.rpmbuild --rebuild autogen-5.11-1.fc15.src.rpm
2.
3.
  
Actual results:
Freezes at the end of make 

Expected results:
Continues on with packaging (ends up failing for other reasons).

Additional info:

Comment 2 Petr Machata 2010-12-09 13:45:16 UTC
Not reproducible with 5.9.4-7, the latest in fedora repository.  Could you attach the problematic srpm to this bugzilla?

Comment 3 RHEL Program Management 2011-01-07 15:31:23 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 4 Anthony Green 2011-01-07 16:20:33 UTC
Created attachment 472256 [details]
srpm that won't finish building

Comment 5 Petr Machata 2011-01-10 15:39:53 UTC
It's reproducible with that srpm.  (Even on Fedora with the right make version.)

Comment 6 Petr Machata 2011-01-11 19:52:57 UTC
What seems to be minimal reproducer:

--checkopt-xx.def--
AutoGen Definitions options;
prog-name = check;
prog-title = "Checkout Automated Options";
flag = { name = e; };

--Makefile--
all-am:
	agen5/autogen checkopt-xx.def # $(MAKE)
	echo DONE

$ make -C doc -r -j2

Notes: the comment with $(MAKE) has to be there.  Other variable won't do.  It has to be the new autogen that is launched.  In -jN, N must be >1 (supposedly to enable jobserver).

Comment 7 Petr Machata 2011-01-12 23:33:30 UTC
So the minimal stand-alone reproducer is this:

--hang.mf--
run: hang
	+./hang
	echo DONE

--hang.c--
#include <stdio.h>
int main(int argc, char ** argv) {
    if (fork() == 0)
      execl("/bin/cat", "/bin/cat", NULL);
    return 0;
}

$ make -r -j2 -f hang.mf run
./hang
echo DONE
DONE
#and it hangs, waiting for cat to finish. Pressing C-d does that.

When you remove the initial "+", make doesn't hang.  I don't know what the problem is yet, but that's the essence of the autogen build hang.  In autogen what hangs make is the process "sh".  When rpmbuild hangs, "pstree" shows a pack of sh's rooted right under "init", as the autogen process that launched them died without collecting them.  Killing those sh's un-hangs the build and rpmbuild finishes with error.

The easiest workaround is not to pass %{?_smp_mflags} to make.  I'll look into fixing the make problem next.

Comment 8 Petr Machata 2011-01-14 00:46:11 UTC
When make sees $(MAKE) or initial + in recipe, it assumes that the command will recurse and therefore, in jobserver mode, leaves the jobserver pipe open in sub-process.  (That pipe is used to coordinate parallel builds in face of several make instances.)  Before the toplevel make exits, it looks into that pipe and waits for all the synchronization tokens to turn up.  But your build is stuck in some innocent "sh" that has no idea that it's supposed to be part of a recursive build and that it should close those descriptors that it will never use anyway.  So it doesn't, and the toplevel make hangs there indefinitely.

On make side, dropping master_job_slots sanity check in main.c:clean_jobserver gets rid of the problem.

On autogen side, in doc/Makefile.in, doing something like this gets rid of the recursion trigger:
_MAKE := $(MAKE)
agdoc.texi      : # self-depends upon all executables
	MAKE=$(_MAKE) ./mk-agen-texi.sh
But note that this is just working around the problem.  That variable is being passed down presumably to be used in recursive make invocation, so technically make is right to catch that.

The only upstreamable solution, I think, would be if autogen collected its children.  I don't know why it doesn't, I think I've seen some comments related to SIGCHLD etc. in the code.  FWIW, the shell that stays hanging is opened in agen5/agShell.c:chainOpen.