Bug 1473242

Summary: kernel BUG at mm/page_alloc.c:1877
Product: [Fedora] Fedora Reporter: Ian Chapman <packages>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED UPSTREAM QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 26CC: frank, gansalmon, ggdavisiv, ichavero, itamar, jonathan, kernel-maint, labbott, madhu.chinakonda, mchehab, packages, sigurdur, trevor
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-05 07:51:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
3 Kernel Bug Dumps
none
Kernel Dump with XFS
none
Reproduced "kernel BUG at mm/page_alloc.c:1902!" using 4.15.4-300.fc27.x86_64 none

Description Ian Chapman 2017-07-20 09:55:57 UTC
Created attachment 1301639 [details]
3 Kernel Bug Dumps

Description of problem:

When performing backups using rsync to an ext4 filesystem on top of LUKs & LVM, the system eventually locks up completely and dumps a kernel BUG trace. I've attached several of those traces. Note, even though some of the traces are marked as "tainted", there is an untainted traces also. This has been happening for sometime... perhaps since the 4.9 or 4.10 series of kernels under F25. I've only just tracked down what was going on.

I've tried all sorts of BIOS settings, memory tests, disk tests, disabling swap, changing the destination disk etc.etc but none of that has made any difference. 


Version-Release number of selected component (if applicable):

Linux rex.homenet.lan 4.11.10-300.fc26.x86_64 #1 SMP Wed Jul 12 17:05:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

rsync-3.1.2-5.fc26.x86_64
cryptsetup-1.7.5-1.fc26.x86_64
lvm2-2.02.168-6.fc26.x86_64



How reproducible:


Steps to Reproduce:
1. See above
2.
3.

Actual results:


Expected results:


Additional info:

Please let me know what other output you require. Thanks

Comment 1 Ian Chapman 2017-07-20 11:02:30 UTC
Created attachment 1301665 [details]
Kernel Dump with XFS

Seems EXT4 is not to blame. The same occurs when using XFS

Comment 2 Laura Abbott 2017-07-20 14:44:37 UTC
I suspect the memory layout of your particular system is tripping something. I'd recommend reporting this bug upstream since it's been so consistent. You can use linux-mm@kvack.org or file a bugzilla.kernel.org bug which will also get converted into e-mail.

Comment 3 Ian Chapman 2017-07-22 10:34:55 UTC
Reported upstream as bug 196443

Comment 5 Sigurd Urdahl 2017-09-01 13:44:40 UTC
For what it is worth: I have been hit by the same bug running CentOS7 with kernel3.10.0-514.26.2.el7.x86_64

I have hooked my findings onto Ian's upstream bug.

Comment 6 Ian Chapman 2017-12-26 03:57:37 UTC
I'm not so sure it is the same bug Sigurd. For what it's worth this issue persists with 4.14.6-200.fc26 yet 4.10.17 remains rock solid.

Is there anything that can be done Laura?

Comment 7 Laura Abbott 2018-02-28 03:57:55 UTC
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale. The kernel moves very fast so bugs may get fixed as part of a kernel update. Due to this, we are doing a mass bug update across all of the Fedora 26 kernel bugs.
 
Fedora 26 has now been rebased to 4.15.4-200.fc26.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 27, and are still experiencing this issue, please change the version to Fedora 27.
 
If you experience different issues, please open a new bug report for those.

Comment 8 George G. Davis 2018-02-28 21:00:11 UTC
Created attachment 1402124 [details]
Reproduced "kernel BUG at mm/page_alloc.c:1902!" using 4.15.4-300.fc27.x86_64

I'm still seeing this issue in Fedora 27 using latest kernel-4.15.4-300.fc27.x86_64 on the same S1200BTL/S1200BTL based hardware. This is intended to confirm that this issue still exists in Fedora 27 when using the latest kernel et al (dnf up as of today).

Comment 9 George G. Davis 2018-04-03 01:02:54 UTC
It looks like the F27 updates-testing kernel-4.15.14-300.fc27 should finally resolve this BUG since it includes the following upstream v4.16 commits which I've already confirmed have resolve this BUG for me when using the recent F27 kernel-vanilla-mainline kernel-4.16.0-0.rc7.git1.1.vanilla.knurd.1.fc27.x86_64:

920a9205d268 mm/memblock.c: hardcode the end_pfn being -1
f981611c4ae3 Revert "mm: page_alloc: skip over regions of invalid pfns where possible"

I've installed kernel-4.15.14-300.fc27 and will give it a test to see if this is fixed as of v4.15.14 with the above commits included.

Comment 10 George G. Davis 2018-04-03 12:23:57 UTC
After running running `du`, `rsync`, and `bitbake` stress tests in parallel, the BUG no longer occurs when using F27 updates-testing kernel-4.15.14-300.fc27.x86_64 - I wouldn't dare to run these stress tests in parallel in the past for fear of triggering the BUG. So I'm happy to report that this is now resolved by F27 updates-testing kernel-4.15.14-300.fc27.x86_64.

Comment 11 Fedora End Of Life 2018-05-03 08:09:07 UTC
This message is a reminder that Fedora 26 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 26. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '26'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 26 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 12 Trevor Cordes 2018-05-05 05:09:07 UTC
Ian, you can close this and the upstream bug (CLOSED UPSTREAM).  I can't do it, just the owner (or a bz admin).

For posterity, most of the details are in the upstream bz https://bugzilla.kernel.org/show_bug.cgi?id=196443

Though we never did figure out why it hits so few models of motherboard.

Comment 13 Ian Chapman 2018-05-05 07:51:57 UTC
Fixed upsteadm.