Bug 1013229

Summary: SATA Errors with Asus M2N-E - requires libata.force=noncq
Product: [Fedora] Fedora Reporter: Rich Rauenzahn <rrauenza>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 18CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, marcelo.barbosa, rrauenza
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-05 22:25:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Rich Rauenzahn 2013-09-28 14:37:58 UTC
Description of problem:

[I'm filing this to document a problem and its solution.  Probably I should also pull more system info out of /proc or /sys to properly document what sata hardware is causing this.  Please let me know what other info you need.  There are lots of articles on the net talking about similar problems with various "oh, try this" solutions.  This below is a verified fix using one of those: http://plone.lucidsolutions.co.nz/linux/io/ssd-on-nvidia-sata-port-generates-error-eh-in-swncq-mode-and-failed-command-read-fpdma-queued ]

I've had this problem for a while and thought I had bad cables or a bad drive.  The problem seemed to get worse with SSD's.  I tried various advice I'd seen around the Ubuntu forums of disabling various things in the kernel (libsata.noacpi... also tried libsata.noforce=1.5G)

I convinced myself for a day swapping the sata port of the drive changed the behavior and showed it was the drive not the port/cable.

Here are some example errors.

Sep 27 23:39:28 tendo kernel: [    5.137305] ata4.00: status: { DRDY ERR }
Sep 27 23:39:28 tendo kernel: [    5.137381] ata4.00: error: { ICRC ABRT }
Sep 27 23:39:28 tendo kernel: [    5.137470] ata4: hard resetting link
Sep 27 23:39:28 tendo kernel: [    5.137546] ata4: nv: skipping hardreset on occupied port
Sep 27 23:39:28 tendo kernel: [    5.591305] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 27 23:39:28 tendo kernel: [    5.598774] ata4.00: configured for UDMA/133
Sep 27 23:39:28 tendo kernel: [    5.598877] ata4: EH complete
Sep 27 23:39:28 tendo kernel: [    5.776100] EXT4-fs (dm-4): re-mounted. Opts: discard
Sep 27 23:39:28 tendo kernel: [    5.903544] kvm: disabled by bios
Sep 27 23:39:28 tendo kernel: [    6.006949] ata4: EH in SWNCQ mode,QC:qc_active 0x3FFFD sactive 0x3FFFD
Sep 27 23:39:28 tendo kernel: [    6.007070] ata4: SWNCQ:qc_active 0xFC defer_bits 0x3FF01 last_issue_tag 0x7
Sep 27 23:39:28 tendo kernel: [    6.007070]   dhfis 0x7C dmafis 0x0 sdbfis 0x2
Sep 27 23:39:28 tendo kernel: [    6.007205] ata4: ATA_REG 0x41 ERR_REG 0x84
Sep 27 23:39:28 tendo kernel: [    6.007285] ata4: tag : dhfis dmafis sdbfis sactive
Sep 27 23:39:28 tendo kernel: [    6.007364] ata4: tag 0x2: 1 0 0 1  
Sep 27 23:39:28 tendo kernel: [    6.007439] ata4: tag 0x3: 1 0 0 1  
Sep 27 23:39:28 tendo kernel: [    6.007520] ata4: tag 0x4: 1 0 0 1  
Sep 27 23:39:28 tendo kernel: [    6.007596] ata4: tag 0x5: 1 0 0 1  
Sep 27 23:39:28 tendo kernel: [    6.007675] ata4: tag 0x6: 1 0 0 1  
Sep 27 23:39:28 tendo kernel: [    6.007752] ata4: tag 0x7: 0 0 0 1  
Sep 27 23:39:28 tendo kernel: [    6.007838] ata4.00: exception Emask 0x1 SAct 0x3fffd SErr 0x0 action 0x6 frozen
Sep 27 23:39:28 tendo kernel: [    6.007962] ata4.00: Ata error. fis:0x21
Sep 27 23:39:28 tendo kernel: [    6.008062] ata4.00: failed command: READ FPDMA QUEUED
Sep 27 23:39:28 tendo kernel: [    6.008152] ata4.00: cmd 60/08:00:90:cc:c2/00:00:00:00:00/40 tag 0 ncq 4096 in
Sep 27 23:39:28 tendo kernel: [    6.008152]          res 41/84:38:38:cc:c2/84:00:00:00:00/40 Emask 0x10 (ATA bus error)
Sep 27 23:39:28 tendo kernel: [    6.008394] ata4.00: status: { DRDY ERR }
Sep 27 23:39:28 tendo kernel: [    6.008480] ata4.00: error: { ICRC ABRT }
Sep 27 23:39:28 tendo kernel: [    6.008564] ata4.00: failed command: READ FPDMA QUEUED
Sep 27 23:39:28 tendo kernel: [    6.008652] ata4.00: cmd 60/08:10:10:cc:c2/00:00:00:00:00/40 tag 2 ncq 4096 in
Sep 27 23:39:28 tendo kernel: [    6.008652]          res 41/84:38:38:cc:c2/84:00:00:00:00/40 Emask 0x10 (ATA bus error)
Sep 27 23:39:28 tendo kernel: [    6.008873] ata4.00: status: { DRDY ERR }
Sep 27 23:39:28 tendo kernel: [    6.008949] ata4.00: error: { ICRC ABRT }

Looking over the past few months looking at the unique errors:

cat kernel.log | egrep ata[0-9] | perl -pe's/^.*ata\S+://' | sort -u | grep err

 Ata error. fis:0x20
 Ata error. fis:0x21
 error: { ICRC ABRT }
 error: { IDNF }
 error: { UNC }
 SRST failed (errno=-16)

If I'd paid attention closer to this one I might have solved it sooner:

 cmd 60/08:20:10:6b:67/00:00:04:00:00/40 tag 4 ncq 4096 in

cat kernel.log | egrep ata[0-9] | perl -pe's/^.*ata\S+://' | grep ncq | wc -l
5148


Version-Release number of selected component (if applicable):

Linux tendo 3.10.12-100.fc18.i686.PAE #1 SMP Mon Sep 16 13:16:09 UTC 2013 i686 i686 i386 GNU/Linux

How reproducible:

Seems to require bursts in I/O, like at bootup.

Steps to Reproduce:
1. have a fairly old asus motherboard (m2n-e)
2. upgrade to ssd's (mdadm mirrored in my case)
3. wonder why you're now getting lots of sata errors.

Actual results:

kernel error messages

Expected results:

no kernel error messages

Additional info:

Solution requires libata.force=noncq

Now the problem is gone.  Seems like it would be beneficial if the sata subsystem recognized buggy firmware/sata controllers and turned off ncq itself or properly worked around whatever is going on.

Comment 1 Rich Rauenzahn 2013-09-28 14:41:33 UTC
Reading back over what I wrote, I implied I had this error before I got SSD's.  That might not be true.  I think I had other issues (western digital green drives that were dying, hence the upgrade to SSD's.)

Also, had these:

egrep ata[0-9] | perl -pe's/^.*ata\S+://' | sort -u | grep status
 status: { DRDY }
 status: { DRDY ERR }

The SSD's are Samsung 840 120GB 6GB/s SATA - MZ-7TD120BW

Comment 2 Justin M. Forbes 2013-10-18 21:16:08 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 18 kernel bugs.

Fedora 18 has now been rebased to 3.11.4-101.fc18.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 19, and are still experiencing this issue, please change the version to Fedora 19.

If you experience different issues, please open a new bug report for those.

Comment 3 Rich Rauenzahn 2013-10-18 21:46:53 UTC
I'm sorry, but I'm still on FC18 with no plan to upgrade anytime soon.  I am unable to reproduce this any time soon.  I'm pretty sure this has not been fixed.

Comment 4 Fedora End Of Life 2013-12-21 14:37:45 UTC
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 5 Fedora End Of Life 2014-02-05 22:25:48 UTC
Fedora 18 changed to end-of-life (EOL) status on 2014-01-14. Fedora 18 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.