1844905 – Kernel crashes due to NVMe disk: WD Blue SN550 (WDC WDS100T2B0C)

Bug 1844905 - Kernel crashes due to NVMe disk: WD Blue SN550 (WDC WDS100T2B0C)

Summary: Kernel crashes due to NVMe disk: WD Blue SN550 (WDC WDS100T2B0C)

Keywords:
Status:	NEW
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-07 22:42 UTC by rugk
Modified:	2020-06-10 10:55 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	---
Doc Text:
Clone Of:
Environment:
Last Closed:
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Linux Kernel	208123	0	None	None	None	2020-06-10 10:55:42 UTC

Description rugk 2020-06-07 22:42:31 UTC

1. Please describe the problem:
In Fedora 32 Silverblue, I guess the Linux kernel 5.6.15-300.fc32.x86_64 crashes, because of my NVMe disk (a WDC WDS100T2B0C-00PXH0 aka WD Blue SN550 1TB).

2. What is the Version-Release number of the kernel:
5.6.15-300.fc32.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
Did not try old kernels, only used the 5.6.15-300.fc32.x86_64

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
Very often.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
(How to do that on Silverblue…? Hmm…)

6. Are you running any modules that not shipped with directly Fedora's kernel?:
no

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
log included

----

## What happens

Randomly (I assume when it accesses the file system/the NVMe SSD disk quite much, it just freezes and shows me a fullscreen error. It's always some kind of **ext4 error**, but it's a new installation, so the file system is intact.

Here are some errors:

> t 4948.2505971 EXT4-fs error (device dm-2): __ext4 find_emtry-1536: inode 83829000: comm gdb-session-wor: reading directory lblock 0

IMG_20200604_230820.jpg

-----

> [  213.350921 EXT4-fs error (device dm-2): __ext4 find_entry:1536: inode 83029000: comm glm-session-war: reading directory Iblock @

IMG_20200605_000220.jpg

-----

> { 206.681358) EXT4-fs error (device dm-4): ext4_read_inode_bitmap:200: comm dconf worker: Cannot read inode bitmap - block_group = 1056, inode_bitmap = 34603024
{ 206.681465] EXT4-fs error (device dm-4) in ext4 free. inode:355: IO failure
{ 206.775200] EXT4-fs error (device dm-4): ext4_wait_block_bitmap:520@: comm cheese:cs0: Cannot read block bitmap - block_group = 38, block_bitmap = 1048582
{ 206.775410] EXT4-fs error (device dm-4): ext4_discard_preallocations:4090: comm cheese:cs0: Error -5 reading block bitmap for 38
{ 213.584473] EXT4-fs error (device dm-4): ext4_journal_check_start :84: Detected aborted journal
{ 213.584557] EXT4-fs (dm-4): Remounting filesystem read-only

IMG_20200605_232825.jpg

### What also happened

I assume some kind of this also caused another error: the TPM seems to have been corrupted and I had to regenerate it.

What I actually saw is: At some boot, the BIOS/UEFI showed me a message that claimed I had switched the CPU (of course, I did not, it's the built-in AMD Ryzen CPU) and it needs to regenerate the fTPM values or so.
As I do not have anything that relies on the TPM, I could just choose `Y` (yes) to regenerate it.
(Note: This happened after all photos IIRC.)

## System

Here are all logs with system information (nvme-cli, smartctl, lshw etc.):
https://gist.github.com/rugk/d17c88a7f78c986029c08426235217ed

**Side-note:** I had to learn that not all WDC drives actually [support the custom WDC commands](https://github.com/linux-nvme/nvme-cli/issues/731) that `nvme-cli` provides.

### A log catching the problem

Also I've managed to catch `dmesg` output when this occurred. This time, it **was not noticeable in the graphically**, but I could actually still use the system. However, in the background, it seems to have mounted the whole file system as readonly (and did not tell me lol) – do have a look at the end of that kernel log:
https://gist.github.com/rugk/88cad699c2ccf2cf0d309aa3a81221a1

Funny how the system is still able to run when it throws all these kinds of error…

## Links

Maybe better to read, I've also posted this in the Fedora Ask forum: https://ask.fedoraproject.org/t/investigating-kernel-crashes-due-to-nvme-disk/7620?u=rugk

Should I still report that on https://bugzilla.kernel.org/ or so?

Comment 1 rugk 2020-06-10 10:54:41 UTC

Also reported upstream now:
https://bugzilla.kernel.org/show_bug.cgi?id=208123

Note You need to log in before you can comment on or make changes to this bug.