Bug 1729678 - Samsung 860 EVO SSD errors out and corrupts data with TRIM + NCQ on AMD chipsets
Summary: Samsung 860 EVO SSD errors out and corrupts data with TRIM + NCQ on AMD chipsets
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 30
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-13 12:45 UTC by Solomon Peachy
Modified: 2019-12-05 15:37 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-17 20:09:25 UTC
Type: Bug
Embargoed:
pizza: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 201693 0 None None None 2019-08-05 13:06:08 UTC
Linux Kernel 203475 0 None None None 2019-08-05 13:06:07 UTC

Description Solomon Peachy 2019-07-13 12:45:46 UTC
1. Please describe the problem:

The Samsung 860 EVO SSD has issues with non-Intel SATA chipsets.  When NCQ is enabled, issuing a TRIM will sometimes cause I/O to hiccup and errors like this to get logged:

[  332.792044] ata14.00: exception Emask 0x0 SAct 0x3fffe SErr 0x0 action 0x6 frozen
[  332.798271] ata14.00: failed command: SEND FPDMA QUEUED
[  332.804499] ata14.00: cmd 64/01:08:00:00:00/00:00:00:00:00/a0 tag 1 ncq dma 512 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  332.817145] ata14.00: status: { DRDY }

Data corruption sometimes occurs as well.

Disabling NCQ works around this problem but it obviously comes with a significant performance penalty.

  echo 1 > /sys/block/sda/device/queue_depth

2. What is the Version-Release number of the kernel:

I first encountered it with 4.19.1, but it's still present in the current Fedora kernel-5.1.16-300.fc30.x86_64. 

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

No, this never worked properly. There are reports of this going back to at least 4.14.

(Note that this was not a regression for me, I installed the SSD on a Fedora system with a 4.19.1 kernel)

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Trivial.  With the correct hardware, boot a Fedora (or indeed, any other Linux) system without disabling NCQ, and within a minute, you will have the system hiccup for tens of seconds at a time and start generating errors in the kernel log and in the SMART statistics.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Yes.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

This was reported some time ago on the kernel.org bugzilla but it has not been noticed or acted upon.  The referfenced tickets include many logs and a proposed patch for the problem.  (The patch applies cleanly to the latest kernel.org git code)

Supposedly this problem affects Windows as well, but Samsung has so far shown no interest in issuing a firmware update.

Comment 1 Hans de Goede 2019-07-14 11:16:12 UTC
Thank you for the detailed bug report, I believe that the problem with the upstream bugzilla.kernel.org (bko) bug reports is that they are still being assigned (by bugzilla) to Tejun Heo, whi no longer maintains the upstream ata code, this is now being maintained by Jens Axboe.

I've send Jens Axboe an email about this and I've asked him to merge the patch from the bko203475 bugzilla.

Comment 2 Justin M. Forbes 2019-08-20 17:36:24 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 30 kernel bugs.

Fedora 30 has now been rebased to 5.2.9-200.fc30.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 31, and are still experiencing this issue, please change the version to Fedora 31.

If you experience different issues, please open a new bug report for those.

Comment 3 Justin M. Forbes 2019-09-17 20:09:25 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.


Note You need to log in before you can comment on or make changes to this bug.