Bug 1666948 - kernel boot hangs after printing one line of output.
Summary: kernel boot hangs after printing one line of output.
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 28
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1669846 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-17 03:50 UTC by robert.shteynfeld
Modified: 2019-05-28 23:38 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-28 23:38:10 UTC
Type: Bug


Attachments (Terms of Use)
video of kernel panic (5.36 MB, video/mpeg)
2019-01-17 20:28 UTC, robert.shteynfeld
no flags Details
video of kernel panic (15.18 MB, video/quicktime)
2019-01-17 20:31 UTC, robert.shteynfeld
no flags Details

Description robert.shteynfeld 2019-01-17 03:50:24 UTC
1. Please describe the problem:

Boot hangs after Grub2 kernel is selected printing only one line "Probing EDD ... ok".

2. What is the Version-Release number of the kernel:

4.19.13-200.fc28.x86_64, 4.19.14-200.fc28.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Works in 4.19.10.  Appears in 4.19.13.


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Happens every time I boot 4.19.13.


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Don't know.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

broadcom wireless, and virtualbox drivers, but that should be much later in boot.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Comment 1 Jeremy Cline 2019-01-17 15:43:10 UTC
Hi Robert,

Can you test kernels 4.19.11 and 4.19.12 as well? Knowing the exact version makes finding the regression much easier. Also it would be best to reproduce it without out of tree modules.

There are some tips for debugging boot problems here: https://docs.fedoraproject.org/en-US/quick-docs/kernel/troubleshooting/index.html#_boot_failures. It sounds like you've already removed quiet and rhgb, but adding initcall_debug to the kernel command line might also provide some insight.

Finally, what kind of hardware is this?

Thanks!

Comment 2 robert.shteynfeld 2019-01-17 19:28:11 UTC
4.19.11 & 4.19.12 work ok.  With initcall_debug and earlyprintk=vga I can see output, but it scrolls by too quickly.  boot_delay seems to hang the output after the first few iterations.  The last thing I see on the screen is something like "kernel panic", attempting to kill idle thread.  Will try to get more of the output somehow.

This is a Dell Precision t5500 with dual x5690 cpus.

Thanks.

Comment 3 robert.shteynfeld 2019-01-17 20:28:06 UTC
Created attachment 1521369 [details]
video of kernel panic

video of kernel panic

Comment 4 robert.shteynfeld 2019-01-17 20:31:23 UTC
Created attachment 1521370 [details]
video of kernel panic

Comment 5 robert.shteynfeld 2019-01-19 21:32:47 UTC
Blacklisted kernel modules "wl" and "vboxdrv,vboxnetflt,vboxnetadp", but the kernel still crashes.  I see the modules are not loaded if I boot into 4.9.10, so the blacklisting works.

One non-typical piece in my setup is that my root filesystem is an lvm2 cache filesystem, so I have two extra modules added to /etc/dracut.conf.d/lvm2.conf:

add_drivers+=" dm-cache dm-cache-smq"

Didn't cause problems before.

Comment 6 tad 2019-01-20 01:36:35 UTC
I have this issue on a similar dual socket system.
It started after 4.19.13 and continues on 4.20.3 currently in testing.

boot_delay did not work for me, but I recorded in slow motion the earlyprintk=vga output.

The relevant part was:
kernel bug at mm/page_alloc.c=790
the reset was too blurry to make out clearly.

There were only two commits in 4.19.13 that changes mm/page_alloc.c
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2830bf6f05fb3e05bc4743274b806c821807a684
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=17e2e7d7e1b83fa324b3f099bfe426659aa3c2a4

I think the latter one is the cause, but I haven't had the time to make a test build with it reverted.

Comment 7 robert.shteynfeld 2019-01-23 22:33:25 UTC
It's the first commit above that is causing the problem, not the second.  I removed the first one, and the patched kernel comes up ok.

Comment 8 Hans de Goede 2019-01-24 07:10:47 UTC
Hi,

(In reply to robert.shteynfeld from comment #7)
> It's the first commit above that is causing the problem, not the second.  I
> removed the first one, and the patched kernel comes up ok.

That is good detective work, can you please report this upstream by sending an email to the addresses mentioned in the commit, with linux-kernel@vger.kernel.org in the Cc?

Regards,

Hans

Comment 9 robert.shteynfeld 2019-01-24 14:22:38 UTC
Sent e-mail.

Comment 10 tad 2019-02-01 23:08:25 UTC
I can successfully boot using 4.20.6 available from testing on F29.
Thanks Robert for compiling and going through the emails to get that reverted.

Comment 11 robert.shteynfeld 2019-02-02 01:55:41 UTC
Thanks for your help tad.  kernel.org helps those who help themselves.

Comment 12 Nate Pearlstein 2019-02-03 05:20:01 UTC
*** Bug 1669846 has been marked as a duplicate of this bug. ***

Comment 13 Ben Cotton 2019-05-02 19:56:36 UTC
This message is a reminder that Fedora 28 is nearing its end of life.
On 2019-May-28 Fedora will stop maintaining and issuing updates for
Fedora 28. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora 'version' of '28'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 28 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 14 Ben Cotton 2019-05-28 23:38:10 UTC
Fedora 28 changed to end-of-life (EOL) status on 2019-05-28. Fedora 28 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.