Bug 1487421
Summary: | PM961 NVME Controller Reset | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Dominic Robinson <development-K9RvgheM1OmXW9pm> | ||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||
Status: | CLOSED EOL | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 26 | CC: | airlied, auroux, bskeggs, development-K9RvgheM1OmXW9pm, eparis, esandeen, hdegoede, ichavero, itamar, jarodwilson, jforbes, jglisse, jonathan, josef, josh.harness, jwboyer, kernel-maint, labbott, linville, luto, mchehab, mjg59, nhorman, quintela, vbraun.name | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-05-29 12:10:37 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Dominic Robinson
2017-08-31 21:20:01 UTC
Also not sure how useful this is given the usefulness of smart data on nvme drives, but here's the output: sudo smartctl /dev/nvme0n1 -a [sudo] password for dominic: smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.12.8-300.fc26.x86_64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: SAMSUNG MZVLW256HEHP-00000 Serial Number: XXXXXXXXXXXXXX Firmware Version: CXB7001Q PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 256,060,514,304 [256 GB] Unallocated NVM Capacity: 0 Controller ID: 2 Number of Namespaces: 1 Namespace 1 Size/Capacity: 256,060,514,304 [256 GB] Namespace 1 Utilization: 255,877,681,152 [255 GB] Namespace 1 Formatted LBA Size: 512 Local Time is: Thu Aug 31 22:27:43 2017 BST Firmware Updates (0x16): 3 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL *Other* Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Warning Comp. Temp. Threshold: 77 Celsius Critical Comp. Temp. Threshold: 80 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 7.60W - - 0 0 0 0 0 0 1 + 6.00W - - 1 1 1 1 0 0 2 + 5.10W - - 2 2 2 2 0 0 3 - 0.0400W - - 3 3 3 3 210 1500 4 - 0.0050W - - 4 4 4 4 2200 6000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02, NSID 0x1) Critical Warning: 0x00 Temperature: 36 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 3% Data Units Read: 7,431,775 [3.80 TB] Data Units Written: 7,659,188 [3.92 TB] Host Read Commands: 74,514,720 Host Write Commands: 120,638,330 Controller Busy Time: 395 Power Cycles: 1,246 Power On Hours: 782 Unsafe Shutdowns: 71 Media and Data Integrity Errors: 0 Error Information Log Entries: 30 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 36 Celsius Error Information (NVMe Log 0x01, max 64 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 30 0 0x0018 0x4004 0x02c 0 0 - 1 29 0 0x0017 0x4004 0x02c 0 0 - 2 28 0 0x0018 0x4004 0x02c 0 0 - 3 27 0 0x0017 0x4004 0x02c 0 0 - 4 26 0 0x0018 0x4004 0x02c 0 0 - 5 25 0 0x0017 0x4004 0x02c 0 0 - 6 24 0 0x0018 0x4004 0x02c 0 0 - 7 23 0 0x0017 0x4004 0x02c 0 0 - 8 22 0 0x0018 0x4004 0x02c 0 0 - 9 21 0 0x0017 0x4004 0x02c 0 0 - 10 20 0 0x009f 0x4004 - 0 0 - 11 19 0 0x0094 0x4004 - 0 0 - 12 18 0 0x005f 0x4004 - 0 0 - 13 17 0 0x0016 0x4004 0x02c 0 0 - 14 16 0 0x0015 0x4004 0x02c 0 0 - 15 15 0 0x00c2 0x4004 0x02c 0 0 - ... (14 entries not shown) Error count has not increased since building the machine a year ago. I've been digging some more - definitely looks like the apst issue, looking at the steps taken to debug here: https://www.mail-archive.com/kernel-packages@lists.launchpad.net/msg236507.html I've been able to get the following output: [dominic@hell01-ws01 ~]$ sudo nvme get-feature -f 0x0c -H /dev/nvme0n1 [sudo] password for dominic: get-feature:0xc (Autonomous Power State Transition), Current value:0x000001 Autonomous Power State Transition Enable (APSTE): Enabled Auto PST Entries ................. Entry[ 0] ................. Idle Time Prior to Transition (ITPT): 86 ms Idle Transition Power State (ITPS): 3 ................. Entry[ 1] ................. Idle Time Prior to Transition (ITPT): 86 ms Idle Transition Power State (ITPS): 3 ................. Entry[ 2] ................. Idle Time Prior to Transition (ITPT): 86 ms Idle Transition Power State (ITPS): 3 ................. Entry[ 3] ................. Idle Time Prior to Transition (ITPT): 410 ms Idle Transition Power State (ITPS): 4 ................. Entry[ 4] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 5] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 6] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 7] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 8] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 9] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[10] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[11] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[12] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[13] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[14] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[15] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[16] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[17] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[18] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[19] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[20] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[21] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[22] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[23] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[24] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[25] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[26] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[27] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[28] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[29] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[30] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[31] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. As you can see many instances of power state transition with no idle time. Created attachment 1320767 [details]
Turn off deepest power saving mode for pm961 drives
Compiling this now, will report back.
Created attachment 1321084 [details]
Disable pm961 deeper sleep #2
Ok late night - mistakes were made.
New patch attached - compiled successfully going to test out for a day or so to see if problem is fixed.
I'm not a kernel developer so have limited understanding of what's going on - would greatly appreciate any input.
Also worth noting that just build a production system utilising a couple of these drives with rhel - would prefer this bug not to filter down into any backports. Can confirm that the attached patch, has fixed this issue for me on Fedora 26. Can we look at including this please? Upstream here :) Can you give the relevant line of lspci -nn output? More importantly, can you tell us what kind of computer this is and give dmidecode output? The kernel logs when the device fails would be nice, too. Hi Andy Yes the lspci output (of which the device id and manufacturer id are included in the above patch) is: 01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 [144d:a804] dmidecode output here: https://www.dcrdev.com/dmidecode.txt Mainboard is a MSI B150I GAMING Pro, coupled with a i3 6100 cpu. It's difficult for me to provide logs, I don't thing the relevant entries are getting written to disk before the drive goes offline and subsequently the file system becomes read only. The only indication that something is wrong from the logs is that fsck has been run frequently within a short space of time i.e. the hard resets I performed. I mean this is really the only evidence of it happening: https://www.dcrdev.com/17_08_29_13_40_24_0863.jpg ^ but that's symptomatic What I can say though - is it definitely appears to be this deeper sleep mode getting triggered; after applying my patch I'm not having this issue. It seems like you're already implementing this workaround upstream: https://github.com/torvalds/linux/blob/v4.12/drivers/nvme/host/pci.c#L2074 Except only when coupled with select dell mainboards. Lovely. I wonder how widespread this issue is. I don't know, but atleast one other person has reported this issue against this drive on non dell hardware here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184 I've got a couple of these drives in raid1 on my rhel server, they have different firmware. I built the machine fairly recently, but whilst I was setting up I had to use a Fedora live image to chroot into the system and encountered some strange issues around the filesystem; in hindsight it was probably this issue. smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-693.1.1.el7.x86_64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: SAMSUNG MZVLW256HEHP-00000 Serial Number: XXXXXXXX Firmware Version: CXB7401Q PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 256,060,514,304 [256 GB] Unallocated NVM Capacity: 0 Controller ID: 2 Number of Namespaces: 1 Namespace 1 Size/Capacity: 256,060,514,304 [256 GB] Namespace 1 Utilization: 184,719,413,248 [184 GB] Namespace 1 Formatted LBA Size: 512 Local Time is: Tue Sep 5 23:44:31 2017 BST Firmware Updates (0x16): 3 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL *Other* Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Warning Comp. Temp. Threshold: 68 Celsius Critical Comp. Temp. Threshold: 71 Celsius Fortunately nvme apst isn't implemented on rhel yet, but as you can see similar entries: nvme get-feature -f 0x0c -H /dev/nvme0n1 get-feature:0xc (Autonomous Power State Transition), Current value:00000000 Autonomous Power State Transition Enable (APSTE): Disabled Auto PST Entries ................. Entry[ 0] ................. Idle Time Prior to Transition (ITPT): 60 ms Idle Transition Power State (ITPS): 3 ................. Entry[ 1] ................. Idle Time Prior to Transition (ITPT): 60 ms Idle Transition Power State (ITPS): 3 ................. Entry[ 2] ................. Idle Time Prior to Transition (ITPT): 60 ms Idle Transition Power State (ITPS): 3 ................. Entry[ 3] ................. Idle Time Prior to Transition (ITPT): 9940 ms Idle Transition Power State (ITPS): 4 ................. Entry[ 4] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 5] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 6] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 7] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 8] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[ 9] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[10] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[11] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[12] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[13] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[14] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[15] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[16] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[17] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[18] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[19] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[20] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[21] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[22] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[23] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[24] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[25] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[26] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[27] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[28] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[29] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[30] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Entry[31] ................. Idle Time Prior to Transition (ITPT): 0 ms Idle Transition Power State (ITPS): 0 ................. Do you think there's any chance, the workaround I posted could be merged? I appreciate it's not necessarily the best thing to have lots of tiny hardcoded workarounds for specific hardware. I'm happy to offer my time to get to the root cause, if that is what's needed. Unfortunately I've got quite a lot invested in these drives, having set up several systems. ^ Also the rhel server I was referring to is on a completely different platform ASRock Rack E3C236D2I C236 mainboard / Xeon E3-1245v6 cpu. I've reached out to Samsung. The problem with applying your patch is that it would cause a fairly large power consumption regression on laptops. In the mean time, you should be able to work around the issue by booting with nvme_core.default_ps_max_latency_us=5500 or so. Thanks & FYI - Installed the stock Fedora kernel, added that parameter and within minutes had the same issue. I've just disabled apst altogether for now by setting it to 0; which works as expected. We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. The kernel moves very fast so bugs may get fixed as part of a kernel update. Due to this, we are doing a mass bug update across all of the Fedora 26 kernel bugs. Fedora 26 has now been rebased to 4.15.4-200.fc26. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 27, and are still experiencing this issue, please change the version to Fedora 27. If you experience different issues, please open a new bug report for those. This message is a reminder that Fedora 26 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '26'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 26 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. Fedora 26 changed to end-of-life (EOL) status on 2018-05-29. Fedora 26 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed. Problem still occurs with Fedora 28, especially worse with 4.17 kernels. See also https://bbs.archlinux.org/viewtopic.php?id=238547 This is not fedora specific. My problem is on a Thinkpad X1 Yoga with a Samsung PM960 series 512GB NVMe SSD. The kernel option nvme_core.default_ps_max_latency_us=6000 found somewhere else helped avoid the issue with the later 4.16 kernels but the 4.17 kernels now break it again. Will try changing to 5500 as suggested here, but not expecting any miracles. After a month of usage, nvme_core.default_ps_max_latency_us=200 is stable, no more ext4 errors and crashes... but battery life has shrunk quite a lot. Would it be too much to ask for this to be reopened against Fedora 28 and looked into seriously? It's a major kernel bug given how common those Samsung NVME SSDs are these days. We shouldn't have to choose between random crashes that may or may not corrupt the filesystem, vs. decent battery life. I'm now having ext4 errors again with kernel 4.17.19, even with nvme_core.default_ps_max_latency_us=200. Getting worse and worse!! It's not clear to me that there's anything the kernel can really do about this. From some past experience with these issues, the root cause *seems* to be that there are a handful of laptops out there with bona fide electrical problems. For whatever reason, they're exacerbated by NVMe APST, but that's not the root cause. Further confounding anyone's ability to test anything, at least some of the affected laptops only seem to be affected depending on whether they're plugged in, which makes it extremely hard to tell what's going on. It's also the case that it's basically impossible for a genuine kernel bug to exist here. The kernel is merely asking the hardware politely to save power. If the hardware screws it up, it is a hardware problem. At best the kernel could try to work around it, but it's not clear when or how to do this. It's a serious bug that only occurs with recent versions of the kernel, on pretty common hardware (since Lenovo and Dell seem to both use Samsung SSDs commonly). Perhaps it's not strictly speaking a kernel bug, but if the kernel doesn't work properly on fairly widespread pieces of hardware then it's a problem for the kernel. For a production system, it's just not acceptable to have random filesystem crashes. Pre-APST kernels were very power efficient on laptops with these Samsung SSDs -- my experience in terms of battery life was that APST support didn't improve battery life but caused crashes, and then working around with the latency parameter avoided crashes but degraded battery life significantly. I'm sure that APST is meant the right way to manage power on most NVMe SSDs, but for the sake of everyone with a Samsung controller, it would be nice to have an option to just bypass the whole APST and return to older kernels' behavior [not sure what that means exactly], rather than having to tinker with latency parameters and hope that they're just right. Or are we supposed to throw away a nearly new laptop that works well in all other settings? Return to kernel 4.8 or thereabouts (can't remember when this nonsense started exactly)? Switch to a different distribution? Anyway -- I appreciate that this may not be easy to fix, but I want to make sure that developers are aware this is an ongoing issue and is further exacerbated by recent kernel changes -- I don't want this bug to be swept under the rug due to "Fedora 26 is EOL" and "the bug report is old" when in fact it is getting worse and worse with newer kernels (at least on my system). Update: discovered that Samsung has a firmware update for these SSDs. Hard to know what it covers (found no clear indication that it addresses the APST issue), but who knows. I've just upgraded my Samsung 512 GB PM960 m.2 disk (model MZVLW512HMJP-000L7) from firmware 6L7QCXY7 to 7L7QCXY7. We'll see if this helps. (Of course a firmware fix on Samsung's end would be really the right way to deal with this. Doesn't mean it's happening, but keeping fingers crossed). Sorry, I should have checked for this firmware upgrade before resuming my periodic screaming at the kernel over this issue. For now I'll keep the very aggressive and power-hungry latency setting (200) because I *really* need my system to be stable in the coming weeks, but will take a chance and continue to boot 4.17.19 (or subsequent once available); will report again if crashes continue to occur with the new firmware and this setting. I'd appreciate a report on whether the firmware helps. But "screaming at the kernel" won't get too far. As the kernel person who implemented APST in the first place, I'm quite confident about this... And, for what it's worth, despite your experience of not saving too much power, there are a lot of systems where APST makes a shockingly large difference. It's not just the power saved in the SSD itself -- various systems seem to require that the SSD goes to sleep before the PCIe link goes into a deep ASPM state, and they require that the PCIe links all be in deep ASPM states before the CPU package goes into a deep PC state, and they need that deep PC state to get good battery life. Apparently Intel also suggests that failing to use deep PC states may adversely affect the lifespan of the system as a whole, too. Hey Denis - I'm having the same issue and would also like to try the firmware update. I'm having trouble finding this in Samsung's site. Where did you find it for your model. I'm using a PM951 Samsung NVMe SSD. I'm still using the power-hungry latency setting (200) as I can't afford extra crashes at the moment, so I'm not sure exactly how much the upgrade helped. Qualitatively it seems to have helped some, in that kernel 4.17.19 even with this very low latency setting crashed roughly twice a week before the firmeware upgrade, and ran for 11 days after the firmware upgrade before producing an APST-related ext4 filesystem crash. The fact that it still crashed, though, indicates that the firmware update didn't sort things out completely. I'm now running 4.18.5 which has been well-behaved for 5 days so far. I didn't get the firmware upgrade directly on Samsung's site, I got it from Lenovo (after rebooting in Windows). If you have a Thinkpad, look up Lenovo NVMe SSD firmware update utility. I am under the impression that Samsung's firmware upgrade tool only works for the SSDs they sell directly to consumers; if your Samsung SSD was an OEM product (shipped with your machine) then you're expected get the upgrade from your machine's manufacturer. (But keep looking on both sides in case I'm wrong about this). Denis Very helpful - thanks Denis! I'm also hitting this bug. Just updated the firmware, which can be done under Linux with nvme-cli. My old firmware version was 3L7QCXB7: [root@zen ~]# nvme list Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 S35ENX0HC04495 SAMSUNG MZVLW256HEHP-000L7 1 62.89 GB / 256.06 GB 512 B + 0 B 5L7QCXB7 [root@zen ~]# nvme id-ctrl /dev/nvme0 | grep fr fr : 3L7QCXB7 frmw : 0x16 Download and unzip firmware from https://pcsupport.lenovo.com/gb/en/products/laptops-and-netbooks/thinkpad-t-series-laptops/thinkpad-t470s/downloads/ds119265 Figure out the firmware for your model: [root@zen ~]# grep MZVLW256HEHP-000L7 FWNV30/fwwinsd.pro "SAMSUNG MZVLW256HEHP-000L7","4L7QCXB7","5L7QCXB7","5L7QCXB7_NF_ENC.bin","RaidFWUpdate_V1_1_6.exe","","S","SAMSUNG" Upload and commit the firmware: [root@zen ~]# nvme fw-download /dev/nvme0 --fw=FWNV30/SAMSUNG/5L7QCXB7_NF_ENC.bin [root@zen ~]# nvme fw-commit /dev/nvme0 --slot=0 --action=1 Now reboot your computer; Note: "echo 1 > /sys/class/nvme/nvme0/reset_controller" as suggested in the nvme-fw-commit manpage was not sufficient After a reboot you have the new version: [root@zen ~]# nvme id-ctrl /dev/nvme0 | grep fr fr : 5L7QCXB7 frmw : 0x16 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |