The LLVM+clang the test suite regularly (but not reliably) hangs the kernel (version 4.14.13-300.fc27.x86_64). Sometimes existing processes can limp along for a little bit, but they eventually hang too. In one particular instance, 'top' was open, and I saw "(crashdump)" and "kworker" consuming 100% of the CPU until other threads got ensnarled and the machine was rebooted. How can I help debug this? I've tried disabling core dumps via /etc/security/limits.conf, but that didn't make a difference. If it matters, I'm seeing this bug on a dual Xeon 8168, which probably exacerbates race conditions.
Ding! I tried reproducing this on the console after a clean reboot where X11 was never launched and a bunch of useful debug info was printed. Please see the photos enclosed here: http://znu.io/dual8168hang.tar
Created attachment 1383447 [details] Second example with different backtrace
I've narrowed the bug. By building in /tmp (tmpfs) instead of /home (xfs on NVMe), the bug no longer reproduces.
The bug doesn't reproduce on / (ext4 on SATA SSD). At this point, the bug is either in the NVMe subsystem or xfs.
In case it matters, the lspci -v output: 45:00.0 Non-Volatile memory controller: Toshiba America Info Systems NVMe Controller (rev 01) (prog-if 02 [NVM Express]) Subsystem: Toshiba America Info Systems Device 0001 Flags: bus master, fast devsel, latency 0, IRQ 38, NUMA node 0 Memory at 94200000 (64-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+ Capabilities: [70] Express Endpoint, MSI 00 Capabilities: [b0] MSI-X: Enable+ Count=8 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [168] Alternative Routing-ID Interpretation (ARI) Capabilities: [178] #19 Capabilities: [198] Latency Tolerance Reporting Capabilities: [1a0] L1 PM Substates Kernel driver in use: nvme Kernel modules: nvme
This reproduces on Rawhide too. Both the debug and non-debug version of 4.15.0-0.rc8.git0.1.fc28.x86_64. I'll attach backtraces from the debug build.
Created attachment 1383795 [details] Backtrace part 1 with Rawhide debug kernel
Created attachment 1383796 [details] Backtrace part 2 with Rawhide debug kernel
1) I've now tried both xfs and ext4 on NVMe. They both crash. 2) I've now eliminated NVMe from the equation by testing a loopback partition in /tmp (tmpfs). The bug seems to be in the "[mq-deadline] none" scheduler according to /sys/block/loop0/queue/scheduler.
Can you share a text based version of the log? I can barely read the screenshots. This will need to be reported to the upstream mq-deadline developers if it's a bug in that component.
The images are full size. You can zoom in on them. In any case, I'm going to close the bug. I thought Red Hat might be interested in this because this workstation is sufficiently high end that their RHEL customers would eventually trip over the bug too. But if Red Hat isn't interested, then this bug serves no purpose.