Created attachment 427690 [details] 'dmesg' errors captured in '/var/log/messages' Description of problem: Kernel.org bug also appears in RHEL 5.5 https://bugzilla.kernel.org/show_bug.cgi?id=14831 Version-Release number of selected component (if applicable): RHEL 5.5 kernel 2.6.18-194.3.1.el5 MPT2BIOS 7.05.01.00 (2010.09.09) SAS2008-IT 5.00.00.00 RHEL 5.4 kernel 2.6.18-164.10.1.el5 MPT2BIOS 7.03.00.00 (2009-10-12) SAS2008-IR 4.00.00.00 How reproducible: Configure LSI 2008 in Supermicro 1026T-URF for JBOD operation with eight drives. 'lvm2' striped volume (RAID0). Configure and activate 'smartd'. Write data to volume at varying rates. Slower seems more likely to produce fault. Actual results: Controller resets and drops drive. Reboot and drive recovers. Disable 'smartd' and it functions correctly. Expected results: Should work perfectly. Additional info: 'dmesg' errors in attachment.
Created attachment 439020 [details] kernel messages from failure Happened again with 'smartd' disabled and with *latest* kernel, LSI device driver and LSI IT (initiator target) firmware. Took 37 days of uptime for it to happen. Failure was during moderate write activity rather than light activity as with the 'smartd' pass-through transactions. Kernel messages attached. kernel 5.5 2.6.18-194.8.1.el5 MPT2BIOS-7.05.01.00 (2010.02.09) SAS2008-IT 5.00.00.00 LSI driver mpt2sas-05.00.00.00
Created attachment 439021 [details] kernel messages from corresponding boot
Created attachment 441691 [details] kernel messages from failure with logging_level=0x1F8 Another controller crash, this time with logging_level=0x1F8 set per upstream developer's instruction.
Created attachment 441692 [details] boot-time messages with logging_level=0x1F8
Created attachment 441694 [details] firmware events from boot and failure seq 0001-0016 from boot 0017 from failure
Created attachment 441695 [details] miscellaneous information from 'lsiutil'
Created attachment 441697 [details] boot-time information from 'lsiutil'
This issue was determined to results from a Seagate ST9500420AS drive firmware bug where after a month or two of head unload/load operations from aggressive power saving the drive would seize up. Solution was to run hdparm -B 255 /dev/sd? hdparm -M 254 /dev/sd? to disable the head load/unload behavior. Commands must be run from a newer kernel since 'hdparm' does not work in 2.6.18(rhel5) due to bug 608981. Once applied, boot back to 2.6.18 *without* powering down the server. In addition this adjustment will improve the life of the drives dramatically. The system has been running flawlessly 24x7 now for 350 days since the last boot. Overlooked closing the bug when the solution was determined. It should be closed now.
Posted wrong bug citation. The 'hdparm' issue is bug 548263.
Starlight, the RHEL5.9 uses a 2.6.18-348... kernel, please test your issue with our latest kernel. The possibility that this was already fixed with a driver update is high. Also update your mpt2sas firmware. Thanks, Tomas
As described above, the issue was largely the result of a firmware bug in Seagate Momentus drives and a viable work-around exists. Just restarted 'smartd' and ran 'smartctl' and the pass-thru seems to work ok. No 'syslog' errors even though 'mpt2sas' is running with logging_level=0x1F8. Current 'mpt2sas' version is 09.101.00.00. Kernel 2.6.18-308.24.el5.
(In reply to comment #11) > As described above, the issue was largely the > result of a firmware bug in Seagate Momentus > drives and a viable work-around exists. Oh, shame on me I should have read the bz more thoroughly... Closing the bz and thanks for the fast response.