Bug 1592013 - Samsung NVMe SSD Controller SM951/PM951 goes down under load. PCIE AERs.
Summary: Samsung NVMe SSD Controller SM951/PM951 goes down under load. PCIE AERs.
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 28
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Laura Abbott
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-16 07:28 UTC by bartoszmarc
Modified: 2019-02-21 21:10 UTC (History)
18 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2019-02-21 21:10:31 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dmesg, dmidecode, lspci output (138.10 KB, application/x-gzip)
2018-06-16 07:28 UTC, bartoszmarc
no flags Details

Description bartoszmarc 2018-06-16 07:28:43 UTC
Created attachment 1452111 [details]
dmesg, dmidecode, lspci output

Description of problem: NVMe controller goes down on vmlinuz-4.16.15-300.fc28.x86_64, 


Version-Release number of selected component (if applicable):


How reproducible: higher workload on NVMe drive (Samsung SM951/PM951), might be related to coming out of suspended mode


Additional info:

I've noticed that on fc28 with kernel 4.16.13 I get multiple PCI AERs, but system is stable (have just recompiled kernel & finished backing up whole system). On later kernels (4.16.14 and 4.16.15) NVME controller goes down and my rootfs is remounted RO, reporting 0 size.

I've been using this Thinkpad P50 for 1.25 year (starting from fc25, AFAIR) and problem did not occur until recently - I had to upgrade from fc26 when it started to occur (do not remember which kernel, though). It turned out that on fc28 still occurs, but hopefully with 4.16.13 it will work.

02:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 [144d:a802] (rev 01)

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVKV512HAJH-000L1
Serial Number:                      S2CYNX0H804793
Firmware Version:                   6L0QBXX7

Comment 1 bartoszmarc 2018-07-06 20:02:30 UTC
I've been using 4.16.13 and have not noticed any kernel error messages mentioned previously.

I use 'systemctl suspend' and it does not harm. System is quite stable on kernel:

Linux localhost.localdomain 4.16.13-300.fc28.x86_64 #1 SMP Wed May 30 14:31:00 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 2 bartoszmarc 2018-08-11 07:45:11 UTC
Any ideas what can be wrong? I've just observed it on 4.17.9-200.fc28.x86_64

Comment 3 Laura Abbott 2018-08-13 17:36:52 UTC
Your best bet is to do a bisect between 4.16 and 4.17 to see what got broken or report the bugs to the upstream maintainers

Comment 4 Laura Abbott 2018-10-01 21:26:45 UTC
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.
 
Fedora 28 has now been rebased to 4.18.10-300.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29.
 
If you experience different issues, please open a new bug report for those.

Comment 5 bartoszmarc 2018-10-05 11:48:58 UTC
The issue is still present on 4.18.10-200 on Fedora 28. 

I worked half day without any problems on new kernel. Then I ran build process inside docker and went for lunch. When I came back I could not login (could see "controller down" errors on console). 

Do you think upgrading Samsung firmware can help?

Can anyone point to Samsung docs that describe interface between host and & NVME controller?

Comment 6 bartoszmarc 2018-11-03 09:04:17 UTC
After adding nvme_core.default_ps_max_latency_us=0 pci=nomsi,noaer to kernel cmd ine, problem still occurs.

After firmware upgrade to 7L0QBXX7, problem still occurs.

Comment 7 bartoszmarc 2018-11-10 05:25:34 UTC
The issue has been reproduced on 4.14.29 - custom build. As far as I remember kernels <= 4.15 were stable. Maybe this is hardware/firmware issue.

Comment 8 bartoszmarc 2018-11-14 06:07:28 UTC
After applying below diff, system seems stable. I was able to build new kernel twice (usually one kernel compilation could not finish due to nvme controller down). I haven't reproduced issue (yet) - will let you know if it happens again. 

Will you incorporate/upstream it? Without this workaround in the kernel I will not be able to upgrade to new fedora (too much I/O going on during upgrade). Also might be useful to others equipped with Lenovo P50 laptop.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c33bb201b884..0fb06e6d5361 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2449,7 +2449,9 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev)
                 */
                if (dmi_match(DMI_SYS_VENDOR, "Dell Inc.") &&
                    (dmi_match(DMI_PRODUCT_NAME, "XPS 15 9550") ||
-                    dmi_match(DMI_PRODUCT_NAME, "Precision 5510")))
+                    dmi_match(DMI_PRODUCT_NAME, "Precision 5510")) ||
+                    dmi_match(DMI_SYS_VENDOR, "LENOVO") &&
+                    (dmi_match(DMI_PRODUCT_NAME, "20EQS0VV2S")))
                        return NVME_QUIRK_NO_DEEPEST_PS;
        } else if (pdev->vendor == 0x144d && pdev->device == 0xa804) {
                /*

Comment 9 bartoszmarc 2018-11-14 06:19:43 UTC
Below version compiles without gcc warnings:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c33bb201b884..f8b056d33ee2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2447,9 +2447,11 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev)
                 * 950 PRO 256GB", but it seems to be restricted to two Dell
                 * laptops.
                 */
-               if (dmi_match(DMI_SYS_VENDOR, "Dell Inc.") &&
+               if ((dmi_match(DMI_SYS_VENDOR, "Dell Inc.") &&
                    (dmi_match(DMI_PRODUCT_NAME, "XPS 15 9550") ||
-                    dmi_match(DMI_PRODUCT_NAME, "Precision 5510")))
+                    dmi_match(DMI_PRODUCT_NAME, "Precision 5510"))) ||
+                    (dmi_match(DMI_SYS_VENDOR, "LENOVO") &&
+                    dmi_match(DMI_PRODUCT_NAME, "20EQS0VV2S")))
                        return NVME_QUIRK_NO_DEEPEST_PS;
        } else if (pdev->vendor == 0x144d && pdev->device == 0xa804) {
                /*

Comment 10 Justin M. Forbes 2019-01-29 16:25:47 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.

Fedora 28 has now been rebased to 4.20.5-100.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29.

If you experience different issues, please open a new bug report for those.

Comment 11 Justin M. Forbes 2019-02-21 21:10:31 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.


Note You need to log in before you can comment on or make changes to this bug.