Bug 1844905

Summary: Kernel crashes due to NVMe disk: WD Blue SN550 (WDC WDS100T2B0C)
Product: [Fedora] Fedora Reporter: rugk <7d28c752>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: acaringi, airlied, bskeggs, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, mjg59, steved
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: ---
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description rugk 2020-06-07 22:42:31 UTC
1. Please describe the problem:
In Fedora 32 Silverblue, I guess the Linux kernel 5.6.15-300.fc32.x86_64 crashes, because of my NVMe disk (a WDC WDS100T2B0C-00PXH0 aka WD Blue SN550 1TB).

2. What is the Version-Release number of the kernel:
5.6.15-300.fc32.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
Did not try old kernels, only used the 5.6.15-300.fc32.x86_64

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
Very often.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
(How to do that on Silverblue…? Hmm…)

6. Are you running any modules that not shipped with directly Fedora's kernel?:
no

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
log included

----

## What happens

Randomly (I assume when it accesses the file system/the NVMe SSD disk quite much, it just freezes and shows me a fullscreen error. It's always some kind of **ext4 error**, but it's a new installation, so the file system is intact.

Here are some errors:

> t 4948.2505971 EXT4-fs error (device dm-2): __ext4 find_emtry-1536: inode 83829000: comm gdb-session-wor: reading directory lblock 0

IMG_20200604_230820.jpg

-----

> [  213.350921 EXT4-fs error (device dm-2): __ext4 find_entry:1536: inode 83029000: comm glm-session-war: reading directory Iblock @

IMG_20200605_000220.jpg

-----

> { 206.681358) EXT4-fs error (device dm-4): ext4_read_inode_bitmap:200: comm dconf worker: Cannot read inode bitmap - block_group = 1056, inode_bitmap = 34603024
{ 206.681465] EXT4-fs error (device dm-4) in ext4 free. inode:355: IO failure
{ 206.775200] EXT4-fs error (device dm-4): ext4_wait_block_bitmap:520@: comm cheese:cs0: Cannot read block bitmap - block_group = 38, block_bitmap = 1048582
{ 206.775410] EXT4-fs error (device dm-4): ext4_discard_preallocations:4090: comm cheese:cs0: Error -5 reading block bitmap for 38
{ 213.584473] EXT4-fs error (device dm-4): ext4_journal_check_start :84: Detected aborted journal
{ 213.584557] EXT4-fs (dm-4): Remounting filesystem read-only

IMG_20200605_232825.jpg

### What also happened

I assume some kind of this also caused another error: the TPM seems to have been corrupted and I had to regenerate it.

What I actually saw is: At some boot, the BIOS/UEFI showed me a message that claimed I had switched the CPU (of course, I did not, it's the built-in AMD Ryzen CPU) and it needs to regenerate the fTPM values or so.
As I do not have anything that relies on the TPM, I could just choose `Y` (yes) to regenerate it.
(Note: This happened after all photos IIRC.)

## System

Here are all logs with system information (nvme-cli, smartctl, lshw etc.):
https://gist.github.com/rugk/d17c88a7f78c986029c08426235217ed

**Side-note:** I had to learn that not all WDC drives actually [support the custom WDC commands](https://github.com/linux-nvme/nvme-cli/issues/731) that `nvme-cli` provides.

### A log catching the problem

Also I've managed to catch `dmesg` output when this occurred. This time, it **was not noticeable in the graphically**, but I could actually still use the system. However, in the background, it seems to have mounted the whole file system as readonly (and did not tell me lol) – do have a look at the end of that kernel log:
https://gist.github.com/rugk/88cad699c2ccf2cf0d309aa3a81221a1

Funny how the system is still able to run when it throws all these kinds of error…

## Links

Maybe better to read, I've also posted this in the Fedora Ask forum: https://ask.fedoraproject.org/t/investigating-kernel-crashes-due-to-nvme-disk/7620?u=rugk

Should I still report that on https://bugzilla.kernel.org/ or so?

Comment 1 rugk 2020-06-10 10:54:41 UTC
Also reported upstream now:
https://bugzilla.kernel.org/show_bug.cgi?id=208123