Hide Forgot
This bug was initially created as a copy of Bug #1556839 I am copying this bug because: This bug is present in RHEL 8 (make-4.2.1-9.el8). Description of problem: Parallel make sometimes hangs with processes in zombie state. Happens with large projects like building the kernel. Version-Release number of selected component (if applicable): make-4.2.1-4.fc27.x86_64 How reproducible: Occasionally. Steps to Reproduce: 1. Do a parallel build of a large project like the linux kernel (make -j8 ...) 2. 3. Actual results: Build sometimes hangs and looking in the process list there are <defunct> processes Expected results: Build completes Additional info: Seems to be a deadlock where the jobserver waits for children to die but at least one child tries to read from the jobserver pipe. This bug seems to be known upstream: https://savannah.gnu.org/bugs/?51159 https://savannah.gnu.org/bugs/?49014 (duplicate) There seems to be a fix in upstream git: https://git.savannah.gnu.org/cgit/make.git/commit/?id=b552b05251980f693c729e251f93f5225b400714
As described in the original Fedora bug report, building the kernel using make-4.2.1-9.el8 causes a number of defunct processes to be created. They are eventually reaped by the time the %build step completes. However they could significantly prolong the build time. The upstream patch referenced in the description section does fix the issue.
Some example timings for kernel builds, -j8 (eight fold parallel). A fully up-to-date RHEL-8.1 system with make-4.2.1-9.el8.x86_64, many defunct processes are observed. Elapsed time: 71m 4s Elapsed time: 54m 38s Elapsed time: 83m 49s With the referenced patch [1] applied to make, no defunct processes are observed. Elapsed time: 23m 10s [1] https://git.savannah.gnu.org/cgit/make.git/commit/?id=b552b05251980f693c729e251f93f5225b400714
The make bug attracted the attention and interest of Linus Torvalds. Quoting one of his posts on lkml.org: https://lkml.org/lkml/2019/12/9/674 [quote] [ Added DJ to the participants, since he seems to be the Fedora make maintainer - DJ, any chance that this absolutely horrid 'make' buf can be fixed in older versions too, not just rawhide? The bugfix is two and a half years old by now, and the bug looks real and very serious ] On Mon, Dec 9, 2019 at 1:54 AM Vincent Guittot <vincent.guittot> wrote: > > Which version of make should I use to reproduce the problem ? So the problematic one is "make-4.2.1-13.fc30.x86_64" in Fedora 30. I'm assuming it's fairly plain 4.2.1, but I didn't try to look into the source rpm or anything like that. The working one for me was just the top of -git from https://git.savannah.gnu.org/git/make.git which is 4.2.92 right now. The fix is presumably commit b552b05 ("[SV 51159] Use a non-blocking read with pselect to avoid hangs") as per Akemi. That is indeed after 4.2.1, and it looks real. (snip snip) But sadly, there's no way I can push that fair pipe wakeup thing as long as this horribly buggy version of make is widespread. Linus [/quote]
rawhide test build: https://koji.fedoraproject.org/koji/buildinfo?buildID=1420394
That test build is for Fedora. Please provide a patched make for RHEL 8 so that we can test.
Support Case #02541226.
Verified against make-4.2.1-10.el8. SanityOnly because of unstable reproducer. Required patch was successfully applied.
Thank you a lot, DJ.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1911