Bug 1297188 - Random boot hangs in initial ramdisk, CPU lockups while loading modules
Random boot hangs in initial ramdisk, CPU lockups while loading modules
Status: CLOSED EOL
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
23
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-10 06:53 EST by Andrej Podzimek
Modified: 2016-12-20 12:45 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1296972
Environment:
Last Closed: 2016-12-20 12:45:47 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Screen photos of kernel backtraces on boot hangs. (18.63 MB, application/x-xz)
2016-01-10 06:53 EST, Andrej Podzimek
no flags Details
/etc/default/grub from which /boot/efi/EFI/fedora/grub.cfg is generated (508 bytes, text/plain)
2016-01-10 07:07 EST, Andrej Podzimek
no flags Details
dmesg from a successful boot (75.88 KB, text/plain)
2016-01-10 08:10 EST, Andrej Podzimek
no flags Details
/etc/fstab (1.17 KB, text/plain)
2016-01-10 08:35 EST, Andrej Podzimek
no flags Details
Hardware listing information from various tools. (9.65 KB, application/x-xz)
2016-01-10 08:40 EST, Andrej Podzimek
no flags Details
Screenshots capturing the first bug/oops/panic messages (9.14 MB, application/x-xz)
2016-01-17 03:16 EST, Andrej Podzimek
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Linux Kernel 105251 None None None 2016-01-18 04:35 EST

  None (edit)
Description Andrej Podzimek 2016-01-10 06:53:00 EST
Created attachment 1113297 [details]
Screen photos of kernel backtraces on boot hangs.

Description of problem:
On a laptop (Dell Inspiron 7548/0AM6R0, BIOS A05 07/20/2015, CPU i7-5500U), I'm seeing random boot hangs during the initial ramdisk phase. None of these events get recorded in system logs, because they occur before the root filesystem can be mounted. The problem affects both the current kernel 4.2.8-300.fc23.x86_64 and the previous 4.2.7 one.

A few screen photos (with rhgb quiet removed from the kernel command line) are attached. All events appear to be related to memory management while kernel modules are being loaded. All the backtraces occur in the bootloader/ramdisk phase. On successful boot attempts, there are *no* such oopses/bugs/glitches whatsoever -- neither in the initial ramdisk phase, nor later during normal operation.

During a boot failure, the machine is stuck in a series of soft lockups for a few minutes, only to reach a hard lockup afterwards. These events occur at random (roughly in 1/3 of boots). Other boots are perfectly normal. Once the system is up and running, there are no problems -- I can easily run it for a week, compile stuff in parallel, play 4k videos, play games on both the Intel GPU and the Radeon GPU (DRI_PRIME=1 etc.), suspend and resume the machine a number of times -- all works fine. But it has to boot first... :-(

Additional info:
Because the failures occur at random and in different processes, I was suspecting a race condition of some kind. Therefore I added udev.children-max=1 and rd.udev.children-max=1 to GRUB_CMDLINE_LINUX and regenerated both grub.cfg and dracut's images. But this tweak did *not* help and I was still seeing the lockups in unmap_pte_range during roughly 1/3 of boot attempts. There are quite likely many other sources of parallelism in systemd components other than udevd...
Comment 1 Andrej Podzimek 2016-01-10 07:07 EST
Created attachment 1113298 [details]
/etc/default/grub from which /boot/efi/EFI/fedora/grub.cfg is generated

/etc/default/grub from the failing system. The LVM and LUKS options shouldn't need to be there, but they don't seem to have any impact on the occurrence of the problems.
Comment 2 Andrej Podzimek 2016-01-10 07:16:38 EST
Here are a few facts about the system I forgot to mention. I have no idea whether they matter, listing them just for the record:

1. It uses SecureBoot in the default/strict mode (so suspend-to-disk won't work, but whatever, suspend-to-ram works fine).
2. It uses LVM, LUKS and Btrfs. No other filesystems are used (besides the one inside the initial ramdisk).
3. SELinux is on and enforcing, autorelabel works find etc.
4. This is my disk layout (quite a common one that I use on all my laptops and that has been working fine in Arch for some time already):

* SSD disk with a GPT partition table on it
  * GPT EFI system partition, 512 MB, vfat, mounts into /boot/efi
  * GPT partition across the rest of the disk with a LVM PV on it
    * LVM LV for /boot, 512 MB, Btrfs
    * LVM LV across the rest of the LVM PV
      * LUKS container across the entire LVM LV, with a LVM PV inside it
        * LVM logical volume for swap
        * LVM logical volume for everything else, Btrfs
          * subvolume for / (mounted explicitly from fstab)
          * subvolume for /var (mounted explicitly from fstab)
          * subvolume for /home (x-systemd.automount)
          * subvolume for /etc (nested under root subvolume, mounts implicitly)
Comment 3 Andrej Podzimek 2016-01-10 08:10 EST
Created attachment 1113309 [details]
dmesg from a successful boot

This shows a successful boot of the system, working as intended. There are no OOPS/BUG backtraces whatsoever once i gets past the initial ramdisk. It boots all the way to the sddm login manager, connects to a EAP-TLS WiFi network etc. A few things that caught my attention:

(1) This has been already seen in bug 1278976 (and enabling doesn't help):

  DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
  DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.

(2) I can't see any PCI-E power saving options in my UEFI setup; not sure what this means:

  ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
  [...]
  acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
  \_SB_.PCI0:_OSC invalid UUID
  _OSC request data:1 1f 0
  acpi PNP0A08:00: _OSC failed (AE_ERROR); disabling ASPM
Comment 4 Andrej Podzimek 2016-01-10 08:35 EST
Created attachment 1113325 [details]
/etc/fstab

Partitions and mount options, if that happens to matter.
Comment 5 Andrej Podzimek 2016-01-10 08:40 EST
Created attachment 1113326 [details]
Hardware listing information from various tools.

The archive contains the output from the following commands:

* inxi -Fx
* lsblk -o +uuid
* lscpu
* lshw
* lspci -k
* lsusb     # with nothing in the external ports
* lsusb -v  # with nothing in the external ports
* mount
* pydf
Comment 6 Andrej Podzimek 2016-01-10 09:10:41 EST
Notes on some the kernel command line options I'm using:

* i915.enable_ips=0 -- prevents extreme video flickering (especially in 4K resolutions), likely at the cost of a higher power consumption.

* intel_iommu=igfx_off -- eliminates all the "stale page" bugs in i915/intel and radeon drivers; bugs.freedesktop.org is already aware of this. At the moment I don't run VirtManager+KVM machines, so I don't need the l8test iommu features.

* loglevel=3 -- prevents "radeon 0000:08:00.0: VCE init error (-110)." from interrupting the smooth boot splash; bugs.freedesktop.org already has this reported. (One also needs "kernel.printk = 3 3 3 3" in /etc/sysctl.d/99-blah.conf.)

* vga=current -- reduces the number of screen blips between the UEFI slash screen, the hidden Grub's black screen and the Plymouth boot droplet.

The removal of any of these boot options doesn't stop the boot lockups from happening. :-( But some of them mitigate or eliminate oher failures.
Comment 7 Andrej Podzimek 2016-01-10 16:15:33 EST
Tried various combinations of...
  * intel_iommu=on,igfx_off / intel_iommu=off / intel_iommu=igfx_off
  * intremap=no_x2apic_optout
...but no way, it's still locking up. :-(
Comment 8 Andrej Podzimek 2016-01-10 17:39:18 EST
Another try:
  * intel_iommu=off
  * intremap=off
  * "Virtualization" completely disabled in Setup

Result:
  * Still seeing lockup failures, though roughly twice less frequently. :-(
Comment 9 Josh Boyer 2016-01-11 07:17:37 EST
You might want to run memtest86+ on this machine overnight.
Comment 10 Andrej Podzimek 2016-01-11 11:51:38 EST
I've just accomplished 3 error-free loops of 'memtester 15G'. (That's a userspace memory-checking tool.) So at least 15/16 of my RAM appear to be OK at the first glance. ;-)

memtest86+ doesn't work on EFI, so switching to BIOS mode, booting from a flashdisk etc. would be a bit of a hassle. I can do that if need be, but as already said, I have never seen spurious freezes or other issues once the system gets past the initial ramdisk, which leads me to believe that also the remaining 1G of RAM would work OK.
Comment 11 Andrej Podzimek 2016-01-17 02:52:26 EST
I think I know a workaround. It may be too early for a conclusion, but it's most likely this: https://bugzilla.kernel.org/show_bug.cgi?id=105251#c29 Thus far I've had 50+ successful boots with no hangs.
Comment 12 Andrej Podzimek 2016-01-17 03:16 EST
Created attachment 1115558 [details]
Screenshots capturing the first bug/oops/panic messages

Just a few notes on what I tried while looking for the root cause. Hopefully this may help others if they encounter a similar issue.

To find the right bug report, one needs the right keywords. To get the right keywords, it's good to look at the very first bug/panic messages. That can be done by adding the following sysctl settings and regenerating the dracut images.

kernel.panic = 0
kernel.panic_on_io_nmi = 1
kernel.panic_on_oops = 1
kernel.panic_on_stackoverflow = 1
kernel.panic_on_unrecovered_nmi = 1
kernel.panic_on_warn = 1
kernel.softlockup_panic = 1
kernel.unknown_nmi_panic = 1

While investigating further, I also blacklisted the following kernel modules, just to reduce the number of possible causes:

AMD's GPU computation framework: amdkfd amd_iommu_v2
Someting I don't think this particular machine has: dw_dmac dw_dmac_core
Some i2c devices, based od HID+i2c failures in backtraces: i2c_i801 elan_i2c
KVM modules: kvm kvm_intel

With all of the above blacklisted, I also added the following kernel options, mostly based on other, very distantly related reports:

IOMMUs off (plain iommu=off makes USB 3.0 inoperable): intel_iommu=off amd_iommu=off
Interrupt remapping off: intremap=off
An ancient bug in an unused compiled-in driver: ata_piix.disable_driver
Limit module loading parallelism: udev.children-max=1 rd.udev.children-max=1

With all of that set, I was *still* getting backtraces and hangs. But more of them than before were related to the HID subsystem. Googling for that yielded the kernel bug linked above (https://bugzilla.kernel.org/show_bug.cgi?id=105251).

So this only occurs on recent Dells that have a touchscreen. The hid-multitouch module appears to be causing the issue, but only when it's loaded early in the boot process. Loading hid-multitouch later seems to work fine on my system thus far (although reports from other people vary).
Comment 13 Andrej Podzimek 2016-01-17 03:34:06 EST
Here's my current workaround. The goal is not to crash on boot, yet to keep the touchpad and touchscreen fully operational.

0. Add a modprobe config file called e.g. /etc/modprobe.d/blacklist-hid-multitouch.conf:

    blacklist hid_multitouch

1. Add a systemd service to load the module "manually", e.g., into /etc/systemd/system/load-hid-multitouch.service:

    [Unit]
    Description=Load the blacklisted hid-multitouch module
    Before=display-manager.service

    [Service]
    Type=oneshot
    ExecStart=/usr/sbin/modprobe hid_multitouch

    [Install]
    WantedBy=multi-user.target

2. Regenerate dracut images so that the blacklist takes effect also in initrd:

    dracut --regenerate-all --force

3. Enable the new "late module loading hack" pseudo-service:

    systemctl enable load-hid-multitouch

This work fine for me. I've re-enabled all the modules I had blacklisted and also removed all the kernel options disabling iommu(s). Virtualization is now enabled again (including kvm_intel). I rebooted the machine 50+ times, playing 4K videos both with and without DRI_PRIME=1 and/or suspending and resuming the machine a few times between the reboots. Also did a bit of touchscreen touching each time. Thus far there have been *no* hangs whatsoever. It just works and boots as one would expect.

The kernel version has changed since my initial report -- I have 4.3.3-300 now. But that doesn't seem to matter at all. According to the kernel bugzilla, many different kernels in a number of distros, including Arch, Fedora and Ubuntu, showed the same symptoms, always somehow related to the early loading of hid-multitouch.

Can this be marked as a duplicate of a bug from the kernel bugzilla? https://bugzilla.kernel.org/show_bug.cgi?id=105251

Phew, this was tough! :-)
Comment 14 Benjamin Tissoires 2016-01-18 04:35:20 EST
Hmm, I would think that hid-multitouch is just the trigger that makes a lower level component (i2c bus) to hang the kernel. In the long term, we will want to fix whatever hangs, but I doubt this is actually hid-multitouch which is responsible.

(In reply to Andrej Podzimek from comment #13)
> Can this be marked as a duplicate of a bug from the kernel bugzilla?
> https://bugzilla.kernel.org/show_bug.cgi?id=105251

Marked as external bug tracker. We can not however close this one as a duplicate because the issue still exists in Fedora
Comment 15 Laura Abbott 2016-09-23 15:50:22 EDT
*********** MASS BUG UPDATE **************
 
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 23 kernel bugs.
 
Fedora 23 has now been rebased to 4.7.4-100.fc23.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 24 or 25, and are still experiencing this issue, please change the version to Fedora 24 or 25.
 
If you experience different issues, please open a new bug report for those.
Comment 16 Andrej Podzimek 2016-10-18 10:31:06 EDT
I have Fedora 24 with the latest kernel. (I always keep the system up-to-date.) The hid_multitouch workaround is still needed. I did try to remove it (almost) with each new kernel minor version (or at least whenever I had access to that particular machine), but I still got random hangs with hid_multitouch loaded automatically. Loading hid_multitouch explicitly later in the boot process (using systemd) helps.
Comment 17 Fedora End Of Life 2016-11-24 09:49:37 EST
This message is a reminder that Fedora 23 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 23. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '23'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 23 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.
Comment 18 Fedora End Of Life 2016-12-20 12:45:47 EST
Fedora 23 changed to end-of-life (EOL) status on 2016-12-20. Fedora 23 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.