Bug 1624611

Summary: Samsung PM961 NVME controller resets getting worse in 4.17.x
Product: [Fedora] Fedora Reporter: Denis Auroux <auroux>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 28CC: airlied, auroux, bskeggs, ewk, hdegoede, ichavero, itamar, jarodwilson, jglisse, john.j5live, jonathan, josef, kernel-maint, linville, mchehab, mjg59, steved, velosol
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-21 21:13:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Denis Auroux 2018-09-02 11:51:30 UTC
Bug 1487421 should not have been closed EOL, it still affects recent kernels, with increasing severity.  

On systems with Samsung PM961 SSDs (at least 512 GB ones), such as the Thinkpad X1 Yoga 2nd gen (and presumably many others), there are sporadic NVME controller resets, causing a number of ext4 filesystem errors, with the root filesystem getting remounted readonly and the system getting paralyzed (input/output errors everywhere) -- nothing to do but a hard reboot.  

I can't copy-paste logs given that the system is crashed once this happens, and nothing shows in the system logs since the whole fs is readonly. Sometimes I can't even get to the text console where the messages sometimes show up. Typical messages are complaints about ext4fs errors, such as (with an earlier 4.17):

EXT4-fs error (device nvme0n1p9): __ext4_get_inode_loc:4619: inode #1444291: block 5267532: comm systemd-journal: unable to read itable block

EXT4-fs error (device nvme0n1p9): ext4_find_entry:1437: inode #1442404: comm gdm: reading directory lblock 0

Like everyone else affected, lspci has:

05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961

Setting the kernel parameter nvme_core.default_ps_max_latency_us=5500  as recommended in comment #12 on bug 1487421 helped for a while, at the cost of some power drain. Until it didn't anymore and the problem started coming up with more recent kernels again. 

I then went all the way to 
nvme_core.default_ps_max_latency_us=200
as read elsewhere, which caused a VERY significant power drain but made my system 100% stable through most of the 4.16 and 4.17 kernel line.

And now after 4.17.19 I got ext4 errors again even with this parameter set to 200.

Given that this just used to work fine with older kernels, until a few months ago, without ext4 crashes AND without draining the battery, and given that probably half of the NVME controllers out there are Samsung, would it be too much to ask that this be looked into seriously, and perhaps that all the NVME power efficiency code that started this debacle be rolled back?  It just used to work fine and the new code is neither as stable nor as power efficient as the old code !!!

Denis






Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Laura Abbott 2018-10-01 21:37:44 UTC
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.
 
Fedora 28 has now been rebased to 4.18.10-300.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29.
 
If you experience different issues, please open a new bug report for those.

Comment 2 Denis Auroux 2018-10-15 12:30:21 UTC
The problem seems to be mitigated by a Samsung NVME firmware update + setting nvme_core.default_ps_max_latency_us=200

I have not had a chance to test without the max latency parameter, as this is my work machine and this is a busy period of the year -- not a good time to randomly lose data in ext4 crashes if the Samsung firmware update didn't actually do the trick.

The underlying bad interaction between kernel and non-firmware-updated Samsung NVME SSDs is very likely still present, though, and there may be a number of people who can't update easily, since OEM Samsung SSDs get their firmware upgrades through the PC manufacturer, not from Samsung.

Denis

Comment 3 Justin M. Forbes 2019-01-29 16:29:08 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.

Fedora 28 has now been rebased to 4.20.5-100.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29.

If you experience different issues, please open a new bug report for those.

Comment 4 Justin M. Forbes 2019-02-21 21:13:43 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.