Bug 609134 - mpt2sas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
Summary: mpt2sas - Use of ATA command pass-through results in unreliable operation - d...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Tomas Henzl
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-06-29 13:57 UTC by starlight
Modified: 2013-03-04 15:13 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-03-04 15:13:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
'dmesg' errors captured in '/var/log/messages' (20.57 KB, text/plain)
2010-06-29 13:57 UTC, starlight
no flags Details
kernel messages from failure (59.13 KB, text/plain)
2010-08-16 21:58 UTC, starlight
no flags Details
kernel messages from corresponding boot (57.45 KB, text/plain)
2010-08-16 21:58 UTC, starlight
no flags Details
kernel messages from failure with logging_level=0x1F8 (38.33 KB, text/plain)
2010-08-28 15:21 UTC, starlight
no flags Details
boot-time messages with logging_level=0x1F8 (63.46 KB, text/plain)
2010-08-28 15:33 UTC, starlight
no flags Details
firmware events from boot and failure (1.96 KB, text/plain)
2010-08-28 15:36 UTC, starlight
no flags Details
miscellaneous information from 'lsiutil' (59.63 KB, text/plain)
2010-08-28 15:37 UTC, starlight
no flags Details
boot-time information from 'lsiutil' (4.64 KB, text/plain)
2010-08-28 15:40 UTC, starlight
no flags Details

Description starlight 2010-06-29 13:57:47 UTC
Created attachment 427690 [details]
'dmesg' errors captured in '/var/log/messages'

Description of problem:

Kernel.org bug also appears in RHEL 5.5

https://bugzilla.kernel.org/show_bug.cgi?id=14831

Version-Release number of selected component (if applicable):

RHEL 5.5 kernel 2.6.18-194.3.1.el5
MPT2BIOS 7.05.01.00 (2010.09.09)
SAS2008-IT 5.00.00.00

RHEL 5.4 kernel 2.6.18-164.10.1.el5
MPT2BIOS 7.03.00.00 (2009-10-12)
SAS2008-IR 4.00.00.00

How reproducible:

Configure LSI 2008 in Supermicro 1026T-URF for JBOD
operation with eight drives.  'lvm2' striped volume
(RAID0).  Configure and activate 'smartd'.  Write data
to volume at varying rates.  Slower seems more likely
to produce fault.

  
Actual results:

Controller resets and drops drive.  Reboot and
drive recovers.  Disable 'smartd' and it functions
correctly.

Expected results:

Should work perfectly.

Additional info:

'dmesg' errors in attachment.

Comment 1 starlight 2010-08-16 21:58:04 UTC
Created attachment 439020 [details]
kernel messages from failure

Happened again with 'smartd' disabled and with *latest*
kernel, LSI device driver and LSI IT (initiator target)
firmware.  Took 37 days of uptime for it to happen.
Failure was during moderate write activity rather than
light activity as with the 'smartd' pass-through
transactions.  Kernel messages attached.

kernel 5.5 2.6.18-194.8.1.el5
MPT2BIOS-7.05.01.00 (2010.02.09)
SAS2008-IT 5.00.00.00
LSI driver mpt2sas-05.00.00.00

Comment 2 starlight 2010-08-16 21:58:28 UTC
Created attachment 439021 [details]
kernel messages from corresponding boot

Comment 3 starlight 2010-08-28 15:21:52 UTC
Created attachment 441691 [details]
kernel messages from failure with logging_level=0x1F8

Another controller crash, this time with logging_level=0x1F8 set per upstream developer's instruction.

Comment 4 starlight 2010-08-28 15:33:33 UTC
Created attachment 441692 [details]
boot-time messages with logging_level=0x1F8

Comment 5 starlight 2010-08-28 15:36:24 UTC
Created attachment 441694 [details]
firmware events from boot and failure

seq 0001-0016 from boot 0017 from failure

Comment 6 starlight 2010-08-28 15:37:23 UTC
Created attachment 441695 [details]
miscellaneous information from 'lsiutil'

Comment 7 starlight 2010-08-28 15:40:47 UTC
Created attachment 441697 [details]
boot-time information from 'lsiutil'

Comment 8 starlight 2011-10-16 01:48:04 UTC
This issue was determined to results from a
Seagate ST9500420AS drive firmware bug where
after a month or two of head unload/load
operations from aggressive power saving
the drive would seize up.  Solution was
to run

   hdparm -B 255 /dev/sd?
   hdparm -M 254 /dev/sd?

to disable the head load/unload behavior.
Commands must be run from a newer kernel
since 'hdparm' does not work in 2.6.18(rhel5)
due to bug 608981.  Once applied, boot back
to 2.6.18 *without* powering down the server.

In addition this adjustment will improve the 
life of the drives dramatically.  The system
has been running flawlessly 24x7 now for 350
days since the last boot.

Overlooked closing the bug when the solution
was determined.

It should be closed now.

Comment 9 starlight 2011-10-16 01:54:49 UTC
Posted wrong bug citation.  The 'hdparm' issue is bug 548263.

Comment 10 Tomas Henzl 2013-03-04 13:08:52 UTC
Starlight,
the RHEL5.9 uses a 2.6.18-348... kernel, please test your issue with our latest kernel. The possibility that this was already fixed with a driver update is high. Also update your mpt2sas firmware.
Thanks, Tomas

Comment 11 starlight 2013-03-04 15:02:16 UTC
As described above, the issue was largely the
result of a firmware bug in Seagate Momentus
drives and a viable work-around exists.

Just restarted 'smartd' and ran 'smartctl'
and the pass-thru seems to work ok.  No
'syslog' errors even though 'mpt2sas'
is running with logging_level=0x1F8.

Current 'mpt2sas' version is 09.101.00.00.
Kernel 2.6.18-308.24.el5.

Comment 12 Tomas Henzl 2013-03-04 15:13:24 UTC
(In reply to comment #11)
> As described above, the issue was largely the
> result of a firmware bug in Seagate Momentus
> drives and a viable work-around exists.
Oh, shame on me I should have read the bz more thoroughly...

Closing the bz and thanks for the fast response.


Note You need to log in before you can comment on or make changes to this bug.