609134 – mpt2sas - Use of ATA command pass-through results in unreliable operation - drive / controller resets

Bug 609134 - mpt2sas - Use of ATA command pass-through results in unreliable operation - drive / controller resets

Summary: mpt2sas - Use of ATA command pass-through results in unreliable operation - d...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.5
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Henzl
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-06-29 13:57 UTC by starlight
Modified:	2013-03-04 15:13 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-03-04 15:13:24 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
'dmesg' errors captured in '/var/log/messages' (20.57 KB, text/plain) 2010-06-29 13:57 UTC, starlight	no flags	Details
kernel messages from failure (59.13 KB, text/plain) 2010-08-16 21:58 UTC, starlight	no flags	Details
kernel messages from corresponding boot (57.45 KB, text/plain) 2010-08-16 21:58 UTC, starlight	no flags	Details
kernel messages from failure with logging_level=0x1F8 (38.33 KB, text/plain) 2010-08-28 15:21 UTC, starlight	no flags	Details
boot-time messages with logging_level=0x1F8 (63.46 KB, text/plain) 2010-08-28 15:33 UTC, starlight	no flags	Details
firmware events from boot and failure (1.96 KB, text/plain) 2010-08-28 15:36 UTC, starlight	no flags	Details
miscellaneous information from 'lsiutil' (59.63 KB, text/plain) 2010-08-28 15:37 UTC, starlight	no flags	Details
boot-time information from 'lsiutil' (4.64 KB, text/plain) 2010-08-28 15:40 UTC, starlight	no flags	Details
View All

Description starlight 2010-06-29 13:57:47 UTC

Created attachment 427690 [details]
'dmesg' errors captured in '/var/log/messages'

Description of problem:

Kernel.org bug also appears in RHEL 5.5

https://bugzilla.kernel.org/show_bug.cgi?id=14831

Version-Release number of selected component (if applicable):

RHEL 5.5 kernel 2.6.18-194.3.1.el5
MPT2BIOS 7.05.01.00 (2010.09.09)
SAS2008-IT 5.00.00.00

RHEL 5.4 kernel 2.6.18-164.10.1.el5
MPT2BIOS 7.03.00.00 (2009-10-12)
SAS2008-IR 4.00.00.00

How reproducible:

Configure LSI 2008 in Supermicro 1026T-URF for JBOD
operation with eight drives.  'lvm2' striped volume
(RAID0).  Configure and activate 'smartd'.  Write data
to volume at varying rates.  Slower seems more likely
to produce fault.

  
Actual results:

Controller resets and drops drive.  Reboot and
drive recovers.  Disable 'smartd' and it functions
correctly.

Expected results:

Should work perfectly.

Additional info:

'dmesg' errors in attachment.

Comment 1 starlight 2010-08-16 21:58:04 UTC

Created attachment 439020 [details]
kernel messages from failure

Happened again with 'smartd' disabled and with *latest*
kernel, LSI device driver and LSI IT (initiator target)
firmware.  Took 37 days of uptime for it to happen.
Failure was during moderate write activity rather than
light activity as with the 'smartd' pass-through
transactions.  Kernel messages attached.

kernel 5.5 2.6.18-194.8.1.el5
MPT2BIOS-7.05.01.00 (2010.02.09)
SAS2008-IT 5.00.00.00
LSI driver mpt2sas-05.00.00.00

Comment 2 starlight 2010-08-16 21:58:28 UTC

Created attachment 439021 [details]
kernel messages from corresponding boot

Comment 3 starlight 2010-08-28 15:21:52 UTC

Created attachment 441691 [details]
kernel messages from failure with logging_level=0x1F8

Another controller crash, this time with logging_level=0x1F8 set per upstream developer's instruction.

Comment 4 starlight 2010-08-28 15:33:33 UTC

Created attachment 441692 [details]
boot-time messages with logging_level=0x1F8

Comment 5 starlight 2010-08-28 15:36:24 UTC

Created attachment 441694 [details]
firmware events from boot and failure

seq 0001-0016 from boot 0017 from failure

Comment 6 starlight 2010-08-28 15:37:23 UTC

Created attachment 441695 [details]
miscellaneous information from 'lsiutil'

Comment 7 starlight 2010-08-28 15:40:47 UTC

Created attachment 441697 [details]
boot-time information from 'lsiutil'

Comment 8 starlight 2011-10-16 01:48:04 UTC

This issue was determined to results from a
Seagate ST9500420AS drive firmware bug where
after a month or two of head unload/load
operations from aggressive power saving
the drive would seize up.  Solution was
to run

   hdparm -B 255 /dev/sd?
   hdparm -M 254 /dev/sd?

to disable the head load/unload behavior.
Commands must be run from a newer kernel
since 'hdparm' does not work in 2.6.18(rhel5)
due to bug 608981.  Once applied, boot back
to 2.6.18 *without* powering down the server.

In addition this adjustment will improve the 
life of the drives dramatically.  The system
has been running flawlessly 24x7 now for 350
days since the last boot.

Overlooked closing the bug when the solution
was determined.

It should be closed now.

Comment 9 starlight 2011-10-16 01:54:49 UTC

Posted wrong bug citation.  The 'hdparm' issue is bug 548263.

Comment 10 Tomas Henzl 2013-03-04 13:08:52 UTC

Starlight,
the RHEL5.9 uses a 2.6.18-348... kernel, please test your issue with our latest kernel. The possibility that this was already fixed with a driver update is high. Also update your mpt2sas firmware.
Thanks, Tomas

Comment 11 starlight 2013-03-04 15:02:16 UTC

As described above, the issue was largely the
result of a firmware bug in Seagate Momentus
drives and a viable work-around exists.

Just restarted 'smartd' and ran 'smartctl'
and the pass-thru seems to work ok.  No
'syslog' errors even though 'mpt2sas'
is running with logging_level=0x1F8.

Current 'mpt2sas' version is 09.101.00.00.
Kernel 2.6.18-308.24.el5.

Comment 12 Tomas Henzl 2013-03-04 15:13:24 UTC

(In reply to comment #11)
> As described above, the issue was largely the
> result of a firmware bug in Seagate Momentus
> drives and a viable work-around exists.
Oh, shame on me I should have read the bz more thoroughly...

Closing the bz and thanks for the fast response.

Note You need to log in before you can comment on or make changes to this bug.