Bug 870083 - smartctl -x causes a hard resetting link error on the Intel S2600GZ4 motherboard
smartctl -x causes a hard resetting link error on the Intel S2600GZ4 motherboard
Status: CLOSED WONTFIX
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
16
x86_64 Unspecified
unspecified Severity low
: ---
: ---
Assigned To: David Milburn
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-25 10:27 EDT by Andrew J. Schorr
Modified: 2013-02-13 09:12 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-13 09:12:34 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
dmesg output from a system exhibiting the smartctl -x problem (93.50 KB, text/plain)
2012-10-31 15:41 EDT, Andrew J. Schorr
no flags Details

  None (edit)
Description Andrew J. Schorr 2012-10-25 10:27:01 EDT
Description of problem: On an Intel S2600GZ4 system, if I run "smartctl -x /dev/sd[a-z]" on a SATA disk, it triggers a "hard resetting link" error


Version-Release number of selected component (if applicable):
smartmontools-5.43-1.fc16.x86_64

How reproducible: Run "smartctl -x /dev/sda", for example


Steps to Reproduce:
1. Run "smartctl -x /dev/sda" on a SATA disk
2.
3.
  
Actual results:  You will see error messages in /var/log/messages such as this:
Oct 25 10:25:07 ti19 kernel: [ID kern.err] [ 2024.947186] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Oct 25 10:25:07 ti19 kernel: [ID kern.err] [ 2024.954477] ata8.00: failed command: SMART
Oct 25 10:25:07 ti19 kernel: [ID kern.err] [ 2024.959130] ata8.00: cmd b0/d6:01:e0:4f:c2/00:00:00:00:00/00 tag 0 pio 512 out
Oct 25 10:25:07 ti19 kernel: [ID kern.err] [ 2024.959130]          res d0/00:01:e0:4f:c2/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Oct 25 10:25:07 ti19 kernel: [ID kern.err] [ 2024.976330] ata8.00: status: { Busy }
Oct 25 10:25:07 ti19 kernel: [ID kern.info] [ 2024.980525] ata8: hard resetting link
Oct 25 10:25:07 ti19 kernel: [ID kern.info] [ 2025.143312] ata8.00: configured for UDMA/133
Oct 25 10:25:07 ti19 kernel: [ID kern.info] [ 2025.148175] ata8: EH complete
Oct 25 10:25:07 ti19 kernel: [ID kern.err] [ 2025.152192] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Oct 25 10:25:07 ti19 kernel: [ID kern.err] [ 2025.159468] ata8.00: failed command: SMART
Oct 25 10:25:07 ti19 kernel: [ID kern.err] [ 2025.164140] ata8.00: cmd b0/d6:01:e0:4f:c2/00:00:00:00:00/00 tag 0 pio 512 out
Oct 25 10:25:07 ti19 kernel: [ID kern.err] [ 2025.164140]          res d0/00:01:e0:4f:c2/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Oct 25 10:25:07 ti19 kernel: [ID kern.err] [ 2025.181354] ata8.00: status: { Busy }
Oct 25 10:25:07 ti19 kernel: [ID kern.info] [ 2025.185518] ata8: hard resetting link
Oct 25 10:25:08 ti19 kernel: [ID kern.info] [ 2025.640134] ata8.00: configured for UDMA/133
Oct 25 10:25:08 ti19 kernel: [ID kern.info] [ 2025.645017] ata8: EH complete

And if you were running a SMART self-test, it will fail with status
"Interrupted (host reset)".


Expected results: No error.


Additional info:  This system is running the latest kernel 3.6.2-1.fc16.x86_64.  I have seen this on 4 different types of SATA disks.
Comment 1 Andrew J. Schorr 2012-10-25 10:28:41 EDT
Note: this problem does not occur when running "smartctl -a /dev/sda".

Regards,
Andy
Comment 2 Christian Franke 2012-10-26 17:22:13 EDT
"-x" is the same as "-H -i -g all -c -A -f brief -l xerror,error -l xselftest,selftest -l selective -l directory -l scttemp -l scterc -l devstat -l sataphy", see man page.

According to above kernel logs, hard resets occur after a failing SMART WRITE LOG command to SCT COMMAND log address 0xe0. This command is used by the smartctl "-l sct..." options to READ(!) the info from the drive.

Please test whether
- the problem occurs if only "-l scterc" or "-l scttemp" is used, and
- the problem does not occur if any or all of the other options (included in "-x" but not in "-a") are used: "-l xerror -l xselftest -l directory -l devstat -l sataphy".

If this is the case, I presume there is a problem in the (PIO) DATA OUT pass-through implementation of this specific SATA driver. Smartmontools uses common code for generic Linux SATA: translate ATA commands into SAT PASS-THROUGH SCSI commands and post these via SG_IO ioctl.
Comment 3 Andrew J. Schorr 2012-10-27 15:09:20 EDT
1. Yes, "smartctl -l scterc /dev/sdb" causes the hard resetting link error.

2. Yes, "smartctl -l scttemp /dev/sdb" causes the hard resetting link error.

3. Correct, "smartctl -l xerror -l xselftest -l directory -l devstat -l sataphy /dev/sdb" does not cause any errors.

Thanks,
Andy
Comment 4 Michal Hlavinka 2012-10-31 12:30:38 EDT
Hi Christian, thanks for looking at this.

David: I've been told that you are our expert for this part of kernel. Could you look at it?
Comment 5 David Milburn 2012-10-31 14:24:11 EDT
Would you please attach the output of dmesg after bootup? This will show the controller, driver, and drives you are using.
Comment 6 Andrew J. Schorr 2012-10-31 15:41:14 EDT
Created attachment 636281 [details]
dmesg output from a system exhibiting the smartctl -x problem
Comment 7 David Milburn 2012-12-20 17:34:34 EST
Sorry for the delay, I was able to reproduce the problem on linux-3.6.6 using a Hitachi HDS721016CLA382. I verified this upstream commit fixes the issue


commit 49bd665c5407a453736d3232ee58f2906b42e83c
Author: Maciej Patelczyk <maciej.patelczyk@intel.com>
Date:   Mon Oct 15 14:29:03 2012 +0200

    [SCSI] isci: copy fis 0x34 response into proper buffer
    
    SATA MICROCODE DOWNALOAD fails on isci driver. After receiving Register
    Device to Host (FIS 0x34) frame Initiator resets phy.
    In the frame handler routine response (FIS 0x34) was copied into wrong
    buffer and upper layer did not receive any answer which resulted in
    timeout and reset.
    This patch corrects this bug.
    
    Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
    Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: James Bottomley <JBottomley@Parallels.com>

diff --git a/drivers/scsi/isci/request.c b/drivers/scsi/isci/request.c
index c1bafc3..9594ab6 100644
--- a/drivers/scsi/isci/request.c
+++ b/drivers/scsi/isci/request.c
@@ -1972,7 +1972,7 @@ sci_io_request_frame_handler(struct isci_request *ireq,
                                                                      frame_index,
                                                                      (void **)&frame_buff
 
-                       sci_controller_copy_sata_response(&ireq->stp.req,
+                       sci_controller_copy_sata_response(&ireq->stp.rsp,
                                                               frame_header,
                                                               frame_buffer);
Comment 8 Fedora End Of Life 2013-01-16 08:24:50 EST
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 9 Fedora End Of Life 2013-02-13 09:12:38 EST
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.