Created attachment 1113297 [details]
Screen photos of kernel backtraces on boot hangs.
Description of problem:
On a laptop (Dell Inspiron 7548/0AM6R0, BIOS A05 07/20/2015, CPU i7-5500U), I'm seeing random boot hangs during the initial ramdisk phase. None of these events get recorded in system logs, because they occur before the root filesystem can be mounted. The problem affects both the current kernel 4.2.8-300.fc23.x86_64 and the previous 4.2.7 one.
A few screen photos (with rhgb quiet removed from the kernel command line) are attached. All events appear to be related to memory management while kernel modules are being loaded. All the backtraces occur in the bootloader/ramdisk phase. On successful boot attempts, there are *no* such oopses/bugs/glitches whatsoever -- neither in the initial ramdisk phase, nor later during normal operation.
During a boot failure, the machine is stuck in a series of soft lockups for a few minutes, only to reach a hard lockup afterwards. These events occur at random (roughly in 1/3 of boots). Other boots are perfectly normal. Once the system is up and running, there are no problems -- I can easily run it for a week, compile stuff in parallel, play 4k videos, play games on both the Intel GPU and the Radeon GPU (DRI_PRIME=1 etc.), suspend and resume the machine a number of times -- all works fine. But it has to boot first... :-(
Because the failures occur at random and in different processes, I was suspecting a race condition of some kind. Therefore I added udev.children-max=1 and rd.udev.children-max=1 to GRUB_CMDLINE_LINUX and regenerated both grub.cfg and dracut's images. But this tweak did *not* help and I was still seeing the lockups in unmap_pte_range during roughly 1/3 of boot attempts. There are quite likely many other sources of parallelism in systemd components other than udevd...
Created attachment 1113298 [details]
/etc/default/grub from which /boot/efi/EFI/fedora/grub.cfg is generated
/etc/default/grub from the failing system. The LVM and LUKS options shouldn't need to be there, but they don't seem to have any impact on the occurrence of the problems.
Here are a few facts about the system I forgot to mention. I have no idea whether they matter, listing them just for the record:
1. It uses SecureBoot in the default/strict mode (so suspend-to-disk won't work, but whatever, suspend-to-ram works fine).
2. It uses LVM, LUKS and Btrfs. No other filesystems are used (besides the one inside the initial ramdisk).
3. SELinux is on and enforcing, autorelabel works find etc.
4. This is my disk layout (quite a common one that I use on all my laptops and that has been working fine in Arch for some time already):
* SSD disk with a GPT partition table on it
* GPT EFI system partition, 512 MB, vfat, mounts into /boot/efi
* GPT partition across the rest of the disk with a LVM PV on it
* LVM LV for /boot, 512 MB, Btrfs
* LVM LV across the rest of the LVM PV
* LUKS container across the entire LVM LV, with a LVM PV inside it
* LVM logical volume for swap
* LVM logical volume for everything else, Btrfs
* subvolume for / (mounted explicitly from fstab)
* subvolume for /var (mounted explicitly from fstab)
* subvolume for /home (x-systemd.automount)
* subvolume for /etc (nested under root subvolume, mounts implicitly)
Created attachment 1113309 [details]
dmesg from a successful boot
This shows a successful boot of the system, working as intended. There are no OOPS/BUG backtraces whatsoever once i gets past the initial ramdisk. It boots all the way to the sddm login manager, connects to a EAP-TLS WiFi network etc. A few things that caught my attention:
(1) This has been already seen in bug 1278976 (and enabling doesn't help):
DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.
(2) I can't see any PCI-E power saving options in my UEFI setup; not sure what this means:
ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
\_SB_.PCI0:_OSC invalid UUID
_OSC request data:1 1f 0
acpi PNP0A08:00: _OSC failed (AE_ERROR); disabling ASPM
Created attachment 1113325 [details]
Partitions and mount options, if that happens to matter.
Created attachment 1113326 [details]
Hardware listing information from various tools.
The archive contains the output from the following commands:
* inxi -Fx
* lsblk -o +uuid
* lspci -k
* lsusb # with nothing in the external ports
* lsusb -v # with nothing in the external ports
Notes on some the kernel command line options I'm using:
* i915.enable_ips=0 -- prevents extreme video flickering (especially in 4K resolutions), likely at the cost of a higher power consumption.
* intel_iommu=igfx_off -- eliminates all the "stale page" bugs in i915/intel and radeon drivers; bugs.freedesktop.org is already aware of this. At the moment I don't run VirtManager+KVM machines, so I don't need the l8test iommu features.
* loglevel=3 -- prevents "radeon 0000:08:00.0: VCE init error (-110)." from interrupting the smooth boot splash; bugs.freedesktop.org already has this reported. (One also needs "kernel.printk = 3 3 3 3" in /etc/sysctl.d/99-blah.conf.)
* vga=current -- reduces the number of screen blips between the UEFI slash screen, the hidden Grub's black screen and the Plymouth boot droplet.
The removal of any of these boot options doesn't stop the boot lockups from happening. :-( But some of them mitigate or eliminate oher failures.
Tried various combinations of...
* intel_iommu=on,igfx_off / intel_iommu=off / intel_iommu=igfx_off
...but no way, it's still locking up. :-(
* "Virtualization" completely disabled in Setup
* Still seeing lockup failures, though roughly twice less frequently. :-(
You might want to run memtest86+ on this machine overnight.
I've just accomplished 3 error-free loops of 'memtester 15G'. (That's a userspace memory-checking tool.) So at least 15/16 of my RAM appear to be OK at the first glance. ;-)
memtest86+ doesn't work on EFI, so switching to BIOS mode, booting from a flashdisk etc. would be a bit of a hassle. I can do that if need be, but as already said, I have never seen spurious freezes or other issues once the system gets past the initial ramdisk, which leads me to believe that also the remaining 1G of RAM would work OK.
I think I know a workaround. It may be too early for a conclusion, but it's most likely this: https://bugzilla.kernel.org/show_bug.cgi?id=105251#c29 Thus far I've had 50+ successful boots with no hangs.
Created attachment 1115558 [details]
Screenshots capturing the first bug/oops/panic messages
Just a few notes on what I tried while looking for the root cause. Hopefully this may help others if they encounter a similar issue.
To find the right bug report, one needs the right keywords. To get the right keywords, it's good to look at the very first bug/panic messages. That can be done by adding the following sysctl settings and regenerating the dracut images.
kernel.panic = 0
kernel.panic_on_io_nmi = 1
kernel.panic_on_oops = 1
kernel.panic_on_stackoverflow = 1
kernel.panic_on_unrecovered_nmi = 1
kernel.panic_on_warn = 1
kernel.softlockup_panic = 1
kernel.unknown_nmi_panic = 1
While investigating further, I also blacklisted the following kernel modules, just to reduce the number of possible causes:
AMD's GPU computation framework: amdkfd amd_iommu_v2
Someting I don't think this particular machine has: dw_dmac dw_dmac_core
Some i2c devices, based od HID+i2c failures in backtraces: i2c_i801 elan_i2c
KVM modules: kvm kvm_intel
With all of the above blacklisted, I also added the following kernel options, mostly based on other, very distantly related reports:
IOMMUs off (plain iommu=off makes USB 3.0 inoperable): intel_iommu=off amd_iommu=off
Interrupt remapping off: intremap=off
An ancient bug in an unused compiled-in driver: ata_piix.disable_driver
Limit module loading parallelism: udev.children-max=1 rd.udev.children-max=1
With all of that set, I was *still* getting backtraces and hangs. But more of them than before were related to the HID subsystem. Googling for that yielded the kernel bug linked above (https://bugzilla.kernel.org/show_bug.cgi?id=105251).
So this only occurs on recent Dells that have a touchscreen. The hid-multitouch module appears to be causing the issue, but only when it's loaded early in the boot process. Loading hid-multitouch later seems to work fine on my system thus far (although reports from other people vary).
Here's my current workaround. The goal is not to crash on boot, yet to keep the touchpad and touchscreen fully operational.
0. Add a modprobe config file called e.g. /etc/modprobe.d/blacklist-hid-multitouch.conf:
1. Add a systemd service to load the module "manually", e.g., into /etc/systemd/system/load-hid-multitouch.service:
Description=Load the blacklisted hid-multitouch module
2. Regenerate dracut images so that the blacklist takes effect also in initrd:
dracut --regenerate-all --force
3. Enable the new "late module loading hack" pseudo-service:
systemctl enable load-hid-multitouch
This work fine for me. I've re-enabled all the modules I had blacklisted and also removed all the kernel options disabling iommu(s). Virtualization is now enabled again (including kvm_intel). I rebooted the machine 50+ times, playing 4K videos both with and without DRI_PRIME=1 and/or suspending and resuming the machine a few times between the reboots. Also did a bit of touchscreen touching each time. Thus far there have been *no* hangs whatsoever. It just works and boots as one would expect.
The kernel version has changed since my initial report -- I have 4.3.3-300 now. But that doesn't seem to matter at all. According to the kernel bugzilla, many different kernels in a number of distros, including Arch, Fedora and Ubuntu, showed the same symptoms, always somehow related to the early loading of hid-multitouch.
Can this be marked as a duplicate of a bug from the kernel bugzilla? https://bugzilla.kernel.org/show_bug.cgi?id=105251
Phew, this was tough! :-)
Hmm, I would think that hid-multitouch is just the trigger that makes a lower level component (i2c bus) to hang the kernel. In the long term, we will want to fix whatever hangs, but I doubt this is actually hid-multitouch which is responsible.
(In reply to Andrej Podzimek from comment #13)
> Can this be marked as a duplicate of a bug from the kernel bugzilla?
Marked as external bug tracker. We can not however close this one as a duplicate because the issue still exists in Fedora
*********** MASS BUG UPDATE **************
We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 23 kernel bugs.
Fedora 23 has now been rebased to 4.7.4-100.fc23. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
If you have moved on to Fedora 24 or 25, and are still experiencing this issue, please change the version to Fedora 24 or 25.
If you experience different issues, please open a new bug report for those.
I have Fedora 24 with the latest kernel. (I always keep the system up-to-date.) The hid_multitouch workaround is still needed. I did try to remove it (almost) with each new kernel minor version (or at least whenever I had access to that particular machine), but I still got random hangs with hid_multitouch loaded automatically. Loading hid_multitouch explicitly later in the boot process (using systemd) helps.
This message is a reminder that Fedora 23 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 23. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora 'version'
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.
Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 23 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
Fedora 23 changed to end-of-life (EOL) status on 2016-12-20. Fedora 23 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
Thank you for reporting this bug and we are sorry it could not be fixed.