Bug 1774790

Summary: make: occasional deadlock when using parallel build
Product: Red Hat Enterprise Linux 8 Reporter: Akemi Yagi <toracat>
Component: makeAssignee: DJ Delorie <dj>
make sub component: system-version QA Contact: Michal Kolar <mkolar>
Status: CLOSED ERRATA Docs Contact: Oss Tikhomirova <otikhomi>
Severity: medium    
Priority: unspecified CC: ajb, codonell, dj, fweimer, jwright, knweiss, mcermak, otikhomi, phil, vmukhame
Version: 8.1Keywords: Patch, Triaged
Target Milestone: rc   
Target Release: 8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: make-4.2.1-10.el8 Doc Type: Bug Fix
Doc Text:
.`make` no longer slows down when using parallel builds Previously, while running parallel builds, `make` sub-processes could become temporarily unresponsive when waiting for their turn to run. As a consequence, builds with high `-j` values slowed down or ran at lower effective `-j` values. With this update, the job control logic of `make` is now non-blocking. As a result, builds with high `-j` values run at full `-j` speed.
Story Points: ---
Clone Of:
: 1785447 (view as bug list) Environment:
Last Closed: 2020-04-28 17:03:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Akemi Yagi 2019-11-20 23:58:44 UTC
This bug was initially created as a copy of Bug #1556839

I am copying this bug because: 
This bug is present in RHEL 8 (make-4.2.1-9.el8).


Description of problem:
Parallel make sometimes hangs with processes in zombie state. Happens with large projects like building the kernel.

Version-Release number of selected component (if applicable):
make-4.2.1-4.fc27.x86_64


How reproducible:
Occasionally.

Steps to Reproduce:
1. Do a parallel build of a large project like the linux kernel (make -j8 ...)
2.
3.

Actual results:
Build sometimes hangs and looking in the process list there are <defunct> processes

Expected results:
Build completes

Additional info:
Seems to be a deadlock where the jobserver waits for children to die but at least one child tries to read from the jobserver pipe.
This bug seems to be known upstream:
https://savannah.gnu.org/bugs/?51159
https://savannah.gnu.org/bugs/?49014 (duplicate)

There seems to be a fix in upstream git:
https://git.savannah.gnu.org/cgit/make.git/commit/?id=b552b05251980f693c729e251f93f5225b400714

Comment 1 Akemi Yagi 2019-11-21 00:09:32 UTC
As described in the original Fedora bug report, building the kernel using make-4.2.1-9.el8 causes a number of defunct processes to be created. They are eventually reaped by the time the %build step completes. However they could significantly prolong the build time.

The upstream patch referenced in the description section does fix the issue.

Comment 2 Alan Bartlett 2019-11-22 22:54:12 UTC
Some example timings for kernel builds, -j8 (eight fold parallel).

A fully up-to-date RHEL-8.1 system with make-4.2.1-9.el8.x86_64, many defunct processes are observed.

Elapsed time: 71m 4s
Elapsed time: 54m 38s
Elapsed time: 83m 49s

With the referenced patch [1] applied to make, no defunct processes are observed.

Elapsed time: 23m 10s

[1] https://git.savannah.gnu.org/cgit/make.git/commit/?id=b552b05251980f693c729e251f93f5225b400714

Comment 3 Akemi Yagi 2019-12-09 21:02:41 UTC
The make bug attracted the attention and interest of Linus Torvalds. Quoting one of his posts on lkml.org:

https://lkml.org/lkml/2019/12/9/674

[quote]
[ Added DJ to the participants, since he seems to be the Fedora make
maintainer - DJ, any chance that this absolutely horrid 'make' buf can
be fixed in older versions too, not just rawhide? The bugfix is two
and a half years old by now, and the bug looks real and very serious ]

On Mon, Dec 9, 2019 at 1:54 AM Vincent Guittot
<vincent.guittot> wrote:
>
> Which version of make should I use to reproduce the problem ?

So the problematic one is "make-4.2.1-13.fc30.x86_64" in Fedora 30.
I'm assuming it's fairly plain 4.2.1, but I didn't try to look into
the source rpm or anything like that.

The working one for me was just the top of -git from

    https://git.savannah.gnu.org/git/make.git

which is 4.2.92 right now.

The fix is presumably commit b552b05 ("[SV 51159] Use a non-blocking
read with pselect to avoid hangs") as per Akemi. That is indeed after
4.2.1, and it looks real.
(snip snip)
But sadly, there's no way I can push that fair pipe wakeup thing as
long as this horribly buggy version of make is widespread.

                 Linus
[/quote]

Comment 5 DJ Delorie 2019-12-10 16:39:59 UTC
rawhide test build:
https://koji.fedoraproject.org/koji/buildinfo?buildID=1420394

Comment 6 Akemi Yagi 2019-12-10 17:14:45 UTC
That test build is for Fedora. Please provide a patched make for RHEL 8 so that we can test.

Comment 10 Akemi Yagi 2019-12-18 17:52:17 UTC
Support Case #02541226.

Comment 15 Michal Kolar 2020-02-05 13:57:35 UTC
Verified against make-4.2.1-10.el8.
SanityOnly because of unstable reproducer. Required patch was successfully applied.

Comment 20 Oss Tikhomirova 2020-03-27 00:38:40 UTC
Thank you a lot, DJ.

Comment 22 errata-xmlrpc 2020-04-28 17:03:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1911