Bug 609134

Summary: mpt2sas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
Product: Red Hat Enterprise Linux 5 Reporter: starlight
Component: kernelAssignee: Tomas Henzl <thenzl>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.5CC: pasteur
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-04 15:13:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
'dmesg' errors captured in '/var/log/messages'
none
kernel messages from failure
none
kernel messages from corresponding boot
none
kernel messages from failure with logging_level=0x1F8
none
boot-time messages with logging_level=0x1F8
none
firmware events from boot and failure
none
miscellaneous information from 'lsiutil'
none
boot-time information from 'lsiutil' none

Description starlight 2010-06-29 13:57:47 UTC
Created attachment 427690 [details]
'dmesg' errors captured in '/var/log/messages'

Description of problem:

Kernel.org bug also appears in RHEL 5.5

https://bugzilla.kernel.org/show_bug.cgi?id=14831

Version-Release number of selected component (if applicable):

RHEL 5.5 kernel 2.6.18-194.3.1.el5
MPT2BIOS 7.05.01.00 (2010.09.09)
SAS2008-IT 5.00.00.00

RHEL 5.4 kernel 2.6.18-164.10.1.el5
MPT2BIOS 7.03.00.00 (2009-10-12)
SAS2008-IR 4.00.00.00

How reproducible:

Configure LSI 2008 in Supermicro 1026T-URF for JBOD
operation with eight drives.  'lvm2' striped volume
(RAID0).  Configure and activate 'smartd'.  Write data
to volume at varying rates.  Slower seems more likely
to produce fault.

  
Actual results:

Controller resets and drops drive.  Reboot and
drive recovers.  Disable 'smartd' and it functions
correctly.

Expected results:

Should work perfectly.

Additional info:

'dmesg' errors in attachment.

Comment 1 starlight 2010-08-16 21:58:04 UTC
Created attachment 439020 [details]
kernel messages from failure

Happened again with 'smartd' disabled and with *latest*
kernel, LSI device driver and LSI IT (initiator target)
firmware.  Took 37 days of uptime for it to happen.
Failure was during moderate write activity rather than
light activity as with the 'smartd' pass-through
transactions.  Kernel messages attached.

kernel 5.5 2.6.18-194.8.1.el5
MPT2BIOS-7.05.01.00 (2010.02.09)
SAS2008-IT 5.00.00.00
LSI driver mpt2sas-05.00.00.00

Comment 2 starlight 2010-08-16 21:58:28 UTC
Created attachment 439021 [details]
kernel messages from corresponding boot

Comment 3 starlight 2010-08-28 15:21:52 UTC
Created attachment 441691 [details]
kernel messages from failure with logging_level=0x1F8

Another controller crash, this time with logging_level=0x1F8 set per upstream developer's instruction.

Comment 4 starlight 2010-08-28 15:33:33 UTC
Created attachment 441692 [details]
boot-time messages with logging_level=0x1F8

Comment 5 starlight 2010-08-28 15:36:24 UTC
Created attachment 441694 [details]
firmware events from boot and failure

seq 0001-0016 from boot 0017 from failure

Comment 6 starlight 2010-08-28 15:37:23 UTC
Created attachment 441695 [details]
miscellaneous information from 'lsiutil'

Comment 7 starlight 2010-08-28 15:40:47 UTC
Created attachment 441697 [details]
boot-time information from 'lsiutil'

Comment 8 starlight 2011-10-16 01:48:04 UTC
This issue was determined to results from a
Seagate ST9500420AS drive firmware bug where
after a month or two of head unload/load
operations from aggressive power saving
the drive would seize up.  Solution was
to run

   hdparm -B 255 /dev/sd?
   hdparm -M 254 /dev/sd?

to disable the head load/unload behavior.
Commands must be run from a newer kernel
since 'hdparm' does not work in 2.6.18(rhel5)
due to bug 608981.  Once applied, boot back
to 2.6.18 *without* powering down the server.

In addition this adjustment will improve the 
life of the drives dramatically.  The system
has been running flawlessly 24x7 now for 350
days since the last boot.

Overlooked closing the bug when the solution
was determined.

It should be closed now.

Comment 9 starlight 2011-10-16 01:54:49 UTC
Posted wrong bug citation.  The 'hdparm' issue is bug 548263.

Comment 10 Tomas Henzl 2013-03-04 13:08:52 UTC
Starlight,
the RHEL5.9 uses a 2.6.18-348... kernel, please test your issue with our latest kernel. The possibility that this was already fixed with a driver update is high. Also update your mpt2sas firmware.
Thanks, Tomas

Comment 11 starlight 2013-03-04 15:02:16 UTC
As described above, the issue was largely the
result of a firmware bug in Seagate Momentus
drives and a viable work-around exists.

Just restarted 'smartd' and ran 'smartctl'
and the pass-thru seems to work ok.  No
'syslog' errors even though 'mpt2sas'
is running with logging_level=0x1F8.

Current 'mpt2sas' version is 09.101.00.00.
Kernel 2.6.18-308.24.el5.

Comment 12 Tomas Henzl 2013-03-04 15:13:24 UTC
(In reply to comment #11)
> As described above, the issue was largely the
> result of a firmware bug in Seagate Momentus
> drives and a viable work-around exists.
Oh, shame on me I should have read the bz more thoroughly...

Closing the bz and thanks for the fast response.