Bug 2219024

Summary: second nvme ssd disappears after suspend
Product: [Fedora] Fedora Reporter: Eric M <majzoube>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WORKSFORME QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 38CC: acaringi, adscvr, airlied, alciregi, bskeggs, hdegoede, hpa, jarodwilson, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, ptalbert, steved
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: ---
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-08 16:45:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eric M 2023-07-01 03:43:00 UTC
I have a Lemur Pro (lemp12) from System76. I installed Fedora 38 KDE spin. I have two nvme drives:

# nvme list
Node                  Generic               SN                   Model                                    Namespace  Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme1n1          /dev/ng1n1            S6S1NS0W314806L      Samsung SSD 970 EVO Plus 1TB             0x1          3.11  GB /   1.00  TB    512   B +  0 B   4B2QEXM7
/dev/nvme0n1          /dev/ng0n1            23085Q800011         WD_BLACK SN850X 4000GB                   0x1          4.00  TB /   4.00  TB    512   B +  0 B   624311WD

The default install put them in raid1 btrfs:
# btrfs filesystem show
Label: 'fedora_localhost-live'  uuid: ffb47540-848e-40c3-b0a8-32cb0886d093
        Total devices 2 FS bytes used 141.87GiB
        devid    1 size 3.64TiB used 175.02GiB path /dev/nvme0n1p3
        devid    2 size 931.51GiB used 3.01GiB path /dev/nvme1n1p1

# btrfs filesystem df /
Data, single: total=172.01GiB, used=139.82GiB
System, RAID1: total=8.00MiB, used=48.00KiB
Metadata, RAID1: total=3.00GiB, used=2.05GiB
GlobalReserve, single: total=264.06MiB, used=0.00B

Sleep is set for deep:
# cat /sys/power/mem_sleep 
s2idle [deep]

After closing the lid the system goes to sleep as expected. However on waking up the second ssd is missing:
# nvme list
Node                  Generic               SN                   Model                                    Namespace  Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            23085Q800011         WD_BLACK SN850X 4000GB                   0x1          4.00  TB /   4.00  TB    512   B +  0 B   624311WD


Relevant part of kernel log that shows the system trying to wake up:

kernel: ACPI: PM: Waking up from system sleep state S3
kernel: ACPI: EC: interrupt unblocked
kernel: pcieport 0000:00:1d.0: Unable to change power state from D3hot to D0, device inaccessible
kernel: nvme 0000:2e:00.0: Unable to change power state from D3hot to D0, device inaccessible
kernel: ACPI: EC: event unblocked
kernel: xhci_hcd 0000:00:0d.0: xHC error in resume, USBSTS 0x401, Reinit
kernel: usb usb1: root hub lost power or was reset
kernel: usb usb2: root hub lost power or was reset
kernel: nvme 0000:2e:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: nvme nvme1: Disabling device after reset failure: -19
kernel: pcieport 0000:00:06.0: can't derive routing for PCI INT A
kernel: nvme 0000:01:00.0: PCI INT A: no GSI

This is a well-known problem with nvme drives in Linux. The solution for most people is to set the kernel parameter
nvme_core.default_ps_max_latency_us=0, but this fails. It also fails to set this latency to *any* of the Ex_latency values for the
power states of either ssd (as are in the 'solved' solutions for many of the other identical nvme problems already reported elsewhere). Another suggestion was setting iommu=soft or iommu=pt. Neither of these work.

I've also tried re-seating the ssd as suggested in several other reports. This does not work either.

Given the numerous other reports out there (just google "linux nvme disappear suspend") it seems to be a problem with the nvme driver, but I don't know.
The most relevant link I think I can post is from an interesting patch suggested here: 
https://lore.kernel.org/lkml/20230309093657.GA24373@lst.de/T/
But the kernel parameter they suggest is not in my nvme_core module, so I can't test it.



Reproducible: Always

Steps to Reproduce:
1. close lid to suspend
2. open lid to wake
3. run 'nvme list' and see that the drive is missing.
Actual Results:  
The second drive is not visible after waking.

Expected Results:  
The second drive should wake up from sleep.

kernel: 6.3.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jun 15 02:15:40 UTC 2023 x86_64 GNU/Linux
Running latest updates with fedora 38.

# inxi --admin --verbosity=7 --filter --no-host --width
System:
  Kernel: 6.3.8-200.fc38.x86_64 arch: x86_64 bits: 64 compiler: gcc
    v: 2.39-9.fc38 parameters: BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.3.8-200.fc38.x86_64
  Console: pty pts/1 wm: kwin_wayland DM: SDDM Distro: Fedora release 38
    (Thirty Eight)
Machine:
  Type: Laptop System: System76 product: Lemur Pro v: lemp12 serial: <filter>
    Chassis: type: 9 serial: N/A
  Mobo: System76 model: Lemur Pro v: lemp12 serial: <filter> UEFI: coreboot
    v: 2023-05-16_e9b9ea8 date: 05/16/2023

Comment 1 Eric M 2023-07-08 16:45:04 UTC
I ended up pulling out the Samsung ssd and replacing it with a WD. The problem went away.
I'm marking this as closed.