Bug 1965809 - Kernels 5.12 using bcache make systems freeze
Summary: Kernels 5.12 using bcache make systems freeze
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 34
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-30 10:03 UTC by Rolf Fokkens
Modified: 2021-06-21 09:38 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-21 09:38:04 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
crash1 oops (9.42 KB, text/plain)
2021-05-31 09:15 UTC, Rolf Fokkens
no flags Details
crash2 oops (13.64 KB, text/plain)
2021-05-31 09:16 UTC, Rolf Fokkens
no flags Details

Description Rolf Fokkens 2021-05-30 10:03:35 UTC
1. Please describe the problem:
After upgrading to kernel 5.12.6 / 5.12.7 my system hangs after a random time, but within minutes.

2. What is the Version-Release number of the kernel:
5.12.7

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
All is fine on 5.11.20, since 5.12.6 this happens.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
- Boot system in kernel 5.12.7
- Log in to X
- Within minutes the system is frozen

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
Not tested yet; will do.

6. Are you running any modules that not shipped with directly Fedora's kernel?:
nvidia

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
The system freezes hard; no logs are available.

However at one instant I noticed nvme module related errors on console. No details are available, because it scrolled out of sight.

Tried kdump, and sysrq, but so far nothing can be extracted.

I know this report lacks information; so I consider it a placeholder for
a) more info to be added leter
b) other people who have similar issues.

Hardware:
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 05)
00:14.0 USB controller: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller
00:16.0 Communication controller: Intel Corporation 200 Series PCH CSME HECI #1
00:17.0 SATA controller: Intel Corporation 200 Series PCH SATA controller [AHCI mode]
00:1b.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #21 (rev f0)
00:1c.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #5 (rev f0)
00:1c.7 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #8 (rev f0)
00:1d.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation 200 Series PCH LPC Controller (B250)
00:1f.2 Memory controller: Intel Corporation 200 Series/Z370 Chipset Family Power Management Controller
00:1f.3 Audio device: Intel Corporation 200 Series PCH HD Audio
00:1f.4 SMBus: Intel Corporation 200 Series/Z370 Chipset Family SMBus Controller
01:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

Comment 1 Rolf Fokkens 2021-05-30 12:15:09 UTC
Checked rawhide kernel 5.13.0-0.rc3.20210527gitad9f25d33860.28.fc35, the system froze as well, possibly faster.

Removed nvidia proprietary drivers, the system froze as well.

Comment 2 Chris Murphy 2021-05-30 21:06:29 UTC
This might be a case for using netconsole and a 2nd computer.
https://wiki.archlinux.org/title/General_troubleshooting#netconsole

Comment 3 Chris Murphy 2021-05-30 21:12:53 UTC
An alternative is to independently confirm 5.12.5 works and 5.12.6 fails by compiling upstream source, if if confirmed, then do a git bisect which will help find what commit broke 5.12.6.

Comment 4 Rolf Fokkens 2021-05-31 09:15:57 UTC
Created attachment 1788187 [details]
crash1 oops

Comment 5 Rolf Fokkens 2021-05-31 09:16:38 UTC
Created attachment 1788188 [details]
crash2 oops

Comment 6 Rolf Fokkens 2021-05-31 09:17:53 UTC
netconsole proved to be really useful. Attached two oopses, both related to bcache. I'll reach out to the bcache devs.

Comment 8 Rolf Fokkens 2021-05-31 09:39:15 UTC
https://www.spinics.net/lists/linux-bcache/msg10127.html:

"This is caused by a hidden issue which is triggered by the bio code change in v5.12.

The attached patch can help to avoid the panic, and the finally fixes are under testing and will be posted very soon."

Comment 9 Rolf Fokkens 2021-06-21 09:37:24 UTC
Fixed in kernel 5.12.11-200


Note You need to log in before you can comment on or make changes to this bug.