1536480 – "mq-deadline" block scheduler hangs when running LLVM+clang test suite

Bug 1536480 - "mq-deadline" block scheduler hangs when running LLVM+clang test suite

Summary: "mq-deadline" block scheduler hangs when running LLVM+clang test suite

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-01-19 13:52 UTC by David Zarzycki
Modified:	2018-01-23 12:51 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-01-23 12:51:25 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Second example with different backtrace (2.54 MB, image/jpeg) 2018-01-19 14:53 UTC, David Zarzycki	no flags	Details
Backtrace part 1 with Rawhide debug kernel (2.44 MB, image/jpeg) 2018-01-20 22:55 UTC, David Zarzycki	no flags	Details
Backtrace part 2 with Rawhide debug kernel (2.24 MB, image/jpeg) 2018-01-20 22:55 UTC, David Zarzycki	no flags	Details
View All

Description David Zarzycki 2018-01-19 13:52:58 UTC

The LLVM+clang the test suite regularly (but not reliably) hangs the kernel (version 4.14.13-300.fc27.x86_64). Sometimes existing processes can limp along for a little bit, but they eventually hang too. In one particular instance, 'top' was open, and I saw "(crashdump)" and "kworker" consuming 100% of the CPU until other threads got ensnarled and the machine was rebooted.

How can I help debug this?

I've tried disabling core dumps via /etc/security/limits.conf, but that didn't make a difference.

If it matters, I'm seeing this bug on a dual Xeon 8168, which probably exacerbates race conditions.

Comment 1 David Zarzycki 2018-01-19 14:04:01 UTC

Ding! I tried reproducing this on the console after a clean reboot where X11 was never launched and a bunch of useful debug info was printed. Please see the photos enclosed here:

http://znu.io/dual8168hang.tar

Comment 2 David Zarzycki 2018-01-19 14:53:23 UTC

Created attachment 1383447 [details]
Second example with different backtrace

Comment 3 David Zarzycki 2018-01-19 15:34:17 UTC

I've narrowed the bug. By building in /tmp (tmpfs) instead of /home (xfs on NVMe), the bug no longer reproduces.

Comment 4 David Zarzycki 2018-01-19 16:17:39 UTC

The bug doesn't reproduce on / (ext4 on SATA SSD).

At this point, the bug is either in the NVMe subsystem or xfs.

Comment 5 David Zarzycki 2018-01-19 16:59:48 UTC

In case it matters, the lspci -v output:

45:00.0 Non-Volatile memory controller: Toshiba America Info Systems NVMe Controller (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Toshiba America Info Systems Device 0001
	Flags: bus master, fast devsel, latency 0, IRQ 38, NUMA node 0
	Memory at 94200000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [168] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [178] #19
	Capabilities: [198] Latency Tolerance Reporting
	Capabilities: [1a0] L1 PM Substates
	Kernel driver in use: nvme
	Kernel modules: nvme

Comment 6 David Zarzycki 2018-01-20 22:54:16 UTC

This reproduces on Rawhide too. Both the debug and non-debug version of 4.15.0-0.rc8.git0.1.fc28.x86_64. I'll attach backtraces from the debug build.

Comment 7 David Zarzycki 2018-01-20 22:55:34 UTC

Created attachment 1383795 [details]
Backtrace part 1 with Rawhide debug kernel

Comment 8 David Zarzycki 2018-01-20 22:55:59 UTC

Created attachment 1383796 [details]
Backtrace part 2 with Rawhide debug kernel

Comment 9 David Zarzycki 2018-01-21 14:19:11 UTC

1) I've now tried both xfs and ext4 on NVMe. They both crash.
2) I've now eliminated NVMe from the equation by testing a loopback partition in /tmp (tmpfs).

The bug seems to be in the "[mq-deadline] none" scheduler according to /sys/block/loop0/queue/scheduler.

Comment 10 David Zarzycki 2018-01-21 14:19:35 UTC

1) I've now tried both xfs and ext4 on NVMe. They both crash.
2) I've now eliminated NVMe from the equation by testing a loopback partition in /tmp (tmpfs).

The bug seems to be in the "[mq-deadline] none" scheduler according to /sys/block/loop0/queue/scheduler.

Comment 11 Laura Abbott 2018-01-23 08:45:56 UTC

Can you share a text based version of the log? I can barely read the screenshots. This will need to be reported to the upstream mq-deadline developers if it's a bug in that component.

Comment 12 David Zarzycki 2018-01-23 12:51:25 UTC

The images are full size. You can zoom in on them.

In any case, I'm going to close the bug. I thought Red Hat might be interested in this because this workstation is sufficiently high end that their RHEL customers would eventually trip over the bug too. But if Red Hat isn't interested, then this bug serves no purpose.

Note You need to log in before you can comment on or make changes to this bug.