Bug 630342

Summary: SATA errors: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Product: [Fedora] Fedora Reporter: Andreas M. Kirchwitz <amk>
Component: libatasmartAssignee: Lennart Poettering <lpoetter>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 14CC: ade.rixon, anton, davidz, dougsland, dwysocha, gansalmon, itamar, jonathan, kernel-maint, lpoetter, madhu.chinakonda, mail, wnefal+redhatbugzilla
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-08-16 18:57:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
smartctl -a /dev/sda
none
smartctl -a /dev/sdb none

Description Andreas M. Kirchwitz 2010-09-04 21:29:01 UTC
Description of problem:

Periodically, both SATA drives (Samsung HD103SJ) connected to the on-board SATA controller (ASUS M2NPV-VM with Nvidia GeForce 6150 / nForce 430 / MCP51, driver sata_nv) give kernel errors like this:

=======================================================================
Sep  3 16:44:43 lakai kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Sep  3 16:44:43 lakai kernel: ata1.00: failed command: SMART
Sep  3 16:44:43 lakai kernel: ata1.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Sep  3 16:44:43 lakai kernel:         res 51/84:00:00:4f:c2/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Sep  3 16:44:43 lakai kernel: ata1.00: status: { DRDY ERR }
Sep  3 16:44:43 lakai kernel: ata1.00: error: { ICRC ABRT }
Sep  3 16:44:43 lakai kernel: ata1: hard resetting link
Sep  3 16:44:43 lakai kernel: ata1: nv: skipping hardreset on occupied port
Sep  3 16:44:43 lakai kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) 
Sep  3 16:44:43 lakai kernel: ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
Sep  3 16:44:43 lakai kernel: ata1.00: revalidation failed (errno=-5)
Sep  3 16:44:48 lakai kernel: ata1: hard resetting link 
Sep  3 16:44:48 lakai kernel: ata1: nv: skipping hardreset on occupied port
Sep  3 16:44:49 lakai kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep  3 16:44:49 lakai kernel: ata1.00: configured for UDMA/133
Sep  3 16:44:49 lakai kernel: ata1: EH complete
=======================================================================
Sep  4 03:44:43 lakai kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Sep  4 03:44:43 lakai kernel: ata1.00: failed command: CHECK POWER MODE
Sep  4 03:44:43 lakai kernel: ata1.00: cmd e5/00:00:00:00:00/00:00:00:00:00/00 tag 0
Sep  4 03:44:43 lakai kernel:         res 51/84:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
Sep  4 03:44:43 lakai kernel: ata1.00: status: { DRDY ERR }
Sep  4 03:44:43 lakai kernel: ata1.00: error: { ICRC ABRT }
Sep  4 03:44:43 lakai kernel: ata1: hard resetting link
Sep  4 03:44:43 lakai kernel: ata1: nv: skipping hardreset on occupied port
Sep  4 03:44:43 lakai kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep  4 03:44:43 lakai kernel: ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
Sep  4 03:44:43 lakai kernel: ata1.00: revalidation failed (errno=-5)
Sep  4 03:44:48 lakai kernel: ata1: hard resetting link
Sep  4 03:44:48 lakai kernel: ata1: nv: skipping hardreset on occupied port
Sep  4 03:44:48 lakai kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep  4 03:44:48 lakai kernel: ata1.00: n_sectors mismatch 1953525168 != 268435455
Sep  4 03:44:48 lakai kernel: ata1.00: revalidation failed (errno=-19)
Sep  4 03:44:48 lakai kernel: ata1: limiting SATA link speed to 1.5 Gbps
Sep  4 03:44:48 lakai kernel: ata1.00: limiting speed to UDMA/133:PIO3
Sep  4 03:44:53 lakai kernel: ata1: hard resetting link
Sep  4 03:44:53 lakai kernel: ata1: nv: skipping hardreset on occupied port
Sep  4 03:44:54 lakai kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep  4 03:44:54 lakai kernel: ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
Sep  4 03:44:54 lakai kernel: ata1.00: revalidation failed (errno=-5)
Sep  4 03:44:54 lakai kernel: ata1.00: disabled
Sep  4 03:44:59 lakai kernel: ata1: hard resetting link
Sep  4 03:45:00 lakai kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep  4 03:45:00 lakai kernel: ata1.00: FORCE: horkage modified (noncq)
Sep  4 03:45:00 lakai kernel: ata1.00: ATA-8: SAMSUNG HD103SJ, 1AJ10001, max UDMA/133
Sep  4 03:45:00 lakai kernel: ata1.00: 1953525168 sectors, multi 1: LBA48 NCQ (not used)
Sep  4 03:45:00 lakai kernel: ata1.00: configured for UDMA/133
Sep  4 03:45:00 lakai kernel: sd 0:0:0:0: rejecting I/O to offline device
Sep  4 03:45:00 lakai kernel: ata1: EH complete
=======================================================================

The SATA errors also appear in the harddisk's internal SMART error log. Normal harddisk operation doesn't seem to trigger those errors, but I see it quite often with, for example, "IDENTIFY DEVICE", "CHECK POWER MODE", "SMART READ ATTRIBUTE THRESHOLDS". Both drives are affected. SATA cables are fine, and the power supply is not overloaded.

It can be triggered easily by running "smartcl" several times on the harddisks. Eventually, I will notice a delay of about 5 seconds and then there's a SATA error in /var/log/messages (see above) and in the harddisk's internal SMART error log.

Usually, after the reset of the SATA link all is fine. But sometimes that bug happens during regular read/write operations, and then the RAID-1 is dissolved because the kernel thinks one drive has failed. This makes the machine very unreliable.

This bug looks to me exactly the same as bug #549981, except that this is for Fedora 13 and bug #549981 is for Fedora 12 (different kernel, I guess).

Those SATA errors have occured with all kernels since I swapped my old PATA drives for brandnew SATA drives and installed Fedora 13, even with the recent kernel 2.6.34.6-47.fc13.i686.PAE this error just happened some hours ago (the RAID-1 is currently rebuilding).

On the net, there are a lot of reports about SATA problems. I've tried kernel options like "libata.noacpi=1", "libata.force=noncq" and "sata_nv.adma=0", but none of them helped.

The SATA-related options in the motherboard's BIOS don't allow for any real configuration because for nForce chipsets, most features (like AHCI and NCQ) are done within the driver.

The problem doesn't seem to be specific to this type of SATA controller or SATA drive because bug #549981 describes exactly the same error.

I don't know what to do. Buy a new motherboard? New drives? Who guarantees that this will work any better? I've never had such horrible problems with PATA drives.

Version-Release number of selected component (if applicable):

Kernel 2.6.34.6-47.fc13.i686.PAE (currently running, but happened with all kernel releases since installation of Fedora 13)

How reproducible:

Luckily not always, but every couple of days.

Steps to Reproduce:
1. ASUS M2NPV-VM (motherboard) + SAMSUNG HD103SJ (harddisk)
2. Run "smartctl -a /dev/sda" (or sdb) several times.
3. If you notice a delay of about 5 seconds, you know it just happened again. (check /var/log/messages and SMART error log of harddisk)
  
Actual results:

Kernel SATA errors of type "exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6"
(see description above for excerpt from /var/log/messages) and SMART errors on harddisk.

Expected results:

No kernel SATA errors, no SMART errors on harddisk.

Additional info:

See bug #549981 for further details (different kernel, different hardware, same problem).

I have appended the output of "smartctl -a" of both drives. The drives are obviously in good condition. See SMART Error Log for further details.

Comment 1 Andreas M. Kirchwitz 2010-09-04 21:31:09 UTC
Created attachment 443102 [details]
smartctl -a /dev/sda

Comment 2 Andreas M. Kirchwitz 2010-09-04 21:31:46 UTC
Created attachment 443103 [details]
smartctl -a /dev/sdb

Comment 3 Bug Zapper 2011-05-31 14:27:08 UTC
This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 13 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 4 Andreas M. Kirchwitz 2011-06-08 00:17:01 UTC
The problem still exists in Fedora 14 (presumably in Fedora 15 as well). This makes it difficult to install Fedora on systems with certain harddisk types.

The cause of the problem is a bug in the firmware of the popular Samsung HD103SJ drive (and a lot of other Samsung drives produced since the last 2 years; and basically every harddisks with a buggy SMART implementation). The Samsung firmware has problems with certain SMART commands which can result in data loss. That was a big story in the press a couple of months ago. Samsung released firmware updates for some F4 series drives but not the old (still popular) F3 series.

Fedora triggers this bug when running "udisks-daemon". If a harddisk reports to support SMART, udev enables SMART. This is a bad idea because there are a lot of incompatibilities for harddisks and motherboards. SMART shouldn't be enabled automatically.

I haven't found a kernel option to disable SMART permanently (like there is for features like NCQ). The only way to avoid SATA errors and harddisk crashes is to modify udev rules:

================================================================================

--- /lib/udev/rules.d/.80-udisks.rules_DISABLED 2011-02-01 15:38:24.000000000 +0100
+++ /lib/udev/rules.d/80-udisks.rules   2011-03-13 03:11:57.395714653 +0100
@@ -128,9 +128,9 @@
 # USB ATA enclosures with a SAT layer
-KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="usb", ENV{DEVTYPE}=="disk", IMPORT{program}="udisks-probe-ata-smart $tempnode"
+KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="usb", ENV{DEVTYPE}=="disk", IMPORT{program}="echo UDISKS_ATA_SMART_IS_AVAILABLE=0"
 
 # ATA disks driven by libata
-KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="udisks-probe-ata-smart $tempnode"
+KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="echo UDISKS_ATA_SMART_IS_AVAILABLE=0"
 
 # ATA disks connected via SAS (not driven by libata)
-KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="scsi", ENV{DEVTYPE}=="disk", ENV{ID_VENDOR}=="ATA", IMPORT{program}="udisks-probe-ata-smart $tempnode"
+KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="scsi", ENV{DEVTYPE}=="disk", ENV{ID_VENDOR}=="ATA", IMPORT{program}="echo UDISKS_ATA_SMART_IS_AVAILABLE=0"

================================================================================

With these modifications, SMART just stays as it is. If it was enabled before (smartctl -s on), it's kept enabled, and if there were good reasons to disable it (smartctl -s off / or BIOS settings), it stays disabled.

Would be nice if that could be changed in Fedora so that it's possible again to install Fedora on systems where SMART isn't working properly. In a perfect world, SMART shouldn't be an issue nowadays, but even Samsung doesn't get it right for their drives.

Comment 5 Chuck Ebbert 2011-06-24 01:01:50 UTC
(In reply to comment #4)
> The problem still exists in Fedora 14 (presumably in Fedora 15 as well). This
> makes it difficult to install Fedora on systems with certain harddisk types.
> 
> The cause of the problem is a bug in the firmware of the popular Samsung
> HD103SJ drive (and a lot of other Samsung drives produced since the last 2
> years; and basically every harddisks with a buggy SMART implementation). The
> Samsung firmware has problems with certain SMART commands which can result in
> data loss. That was a big story in the press a couple of months ago. Samsung
> released firmware updates for some F4 series drives but not the old (still
> popular) F3 series.
> 
> Fedora triggers this bug when running "udisks-daemon". If a harddisk reports to
> support SMART, udev enables SMART. This is a bad idea because there are a lot
> of incompatibilities for harddisks and motherboards. SMART shouldn't be enabled
> automatically.

Reassigning to udisks ...

Comment 6 David Zeuthen 2011-06-24 15:00:57 UTC
I agree that the OS should never second-guess the user and enable SMART on a disk if it is already disabled. In this case, it is actually libatasmart that enables SMART automatically:

 http://git.0pointer.de/?p=libatasmart.git;a=blob;f=atasmart.c;h=a4b60c0eedf8e4f1ebafd932b7070c030459ef16;hb=HEAD#l2561

so reassigning to libatasmart.

Comment 7 Fedora End Of Life 2012-08-16 18:57:41 UTC
This message is a notice that Fedora 14 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 14. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained.  At this time, all open bugs with a Fedora 'version'
of '14' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this 
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen 
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we were unable to fix it before Fedora 14 reached end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" (top right of this page) and open it against that 
version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping