Bug 1965809

Summary: Kernels 5.12 using bcache make systems freeze
Product: [Fedora] Fedora Reporter: Rolf Fokkens <rolf>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 34CC: acaringi, adscvr, airlied, alciregi, bskeggs, bugzilla, hdegoede, jarodwilson, jeremy, jglisse, john.mellor, jonathan, josef, jsiero, kernel-maint, lgoncalv, linville, masami256, mchehab, ptalbert, steved, tomek
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-21 09:38:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
crash1 oops
none
crash2 oops none

Description Rolf Fokkens 2021-05-30 10:03:35 UTC
1. Please describe the problem:
After upgrading to kernel 5.12.6 / 5.12.7 my system hangs after a random time, but within minutes.

2. What is the Version-Release number of the kernel:
5.12.7

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
All is fine on 5.11.20, since 5.12.6 this happens.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
- Boot system in kernel 5.12.7
- Log in to X
- Within minutes the system is frozen

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
Not tested yet; will do.

6. Are you running any modules that not shipped with directly Fedora's kernel?:
nvidia

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
The system freezes hard; no logs are available.

However at one instant I noticed nvme module related errors on console. No details are available, because it scrolled out of sight.

Tried kdump, and sysrq, but so far nothing can be extracted.

I know this report lacks information; so I consider it a placeholder for
a) more info to be added leter
b) other people who have similar issues.

Hardware:
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 05)
00:14.0 USB controller: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller
00:16.0 Communication controller: Intel Corporation 200 Series PCH CSME HECI #1
00:17.0 SATA controller: Intel Corporation 200 Series PCH SATA controller [AHCI mode]
00:1b.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #21 (rev f0)
00:1c.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #5 (rev f0)
00:1c.7 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #8 (rev f0)
00:1d.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation 200 Series PCH LPC Controller (B250)
00:1f.2 Memory controller: Intel Corporation 200 Series/Z370 Chipset Family Power Management Controller
00:1f.3 Audio device: Intel Corporation 200 Series PCH HD Audio
00:1f.4 SMBus: Intel Corporation 200 Series/Z370 Chipset Family SMBus Controller
01:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

Comment 1 Rolf Fokkens 2021-05-30 12:15:09 UTC
Checked rawhide kernel 5.13.0-0.rc3.20210527gitad9f25d33860.28.fc35, the system froze as well, possibly faster.

Removed nvidia proprietary drivers, the system froze as well.

Comment 2 Chris Murphy 2021-05-30 21:06:29 UTC
This might be a case for using netconsole and a 2nd computer.
https://wiki.archlinux.org/title/General_troubleshooting#netconsole

Comment 3 Chris Murphy 2021-05-30 21:12:53 UTC
An alternative is to independently confirm 5.12.5 works and 5.12.6 fails by compiling upstream source, if if confirmed, then do a git bisect which will help find what commit broke 5.12.6.

Comment 4 Rolf Fokkens 2021-05-31 09:15:57 UTC
Created attachment 1788187 [details]
crash1 oops

Comment 5 Rolf Fokkens 2021-05-31 09:16:38 UTC
Created attachment 1788188 [details]
crash2 oops

Comment 6 Rolf Fokkens 2021-05-31 09:17:53 UTC
netconsole proved to be really useful. Attached two oopses, both related to bcache. I'll reach out to the bcache devs.

Comment 8 Rolf Fokkens 2021-05-31 09:39:15 UTC
https://www.spinics.net/lists/linux-bcache/msg10127.html:

"This is caused by a hidden issue which is triggered by the bio code change in v5.12.

The attached patch can help to avoid the panic, and the finally fixes are under testing and will be posted very soon."

Comment 9 Rolf Fokkens 2021-06-21 09:37:24 UTC
Fixed in kernel 5.12.11-200