Escalated to Bugzilla from IssueTracker
Description of problem: When we execute smartctl command to hard disk connected by MegaRAID-SAS controller, this hard disk becomes off-line and unoperational. Version-Release number of selected component: RHEL5 Server RC Snapshot3 kernel version: 2.6.18-1.2839.el5 smartctl version: 5.36-3.1.el5 How reproducible: 100% Steps to Reproduce: 1. Execute smartctl command to /dev/sda contained Linux system. # smartctl -a /dev/sda 2.Wait for a while. Actual results: 1. We get following messages after executing smartctl command. smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: LSI MegaRAID 8300XLP Version: 2.02 Device type: disk Local Time is: Fri Dec 15 18:31:57 2006 JST Device does not support SMART Error Counter logging not supported Device does not support Self Test logging 2. Wait for a while, we continue getting following messages, and we cannot stop these messages. megasas: [ 0]waiting for 25 commands to complete megasas: [ 5]waiting for 25 commands to complete megasas: [10]waiting for 25 commands to complete megasas: [15]waiting for 25 commands to complete megasas: [20]waiting for 25 commands to complete megasas: [25]waiting for 25 commands to complete ... 3. If we execute shutdown command, we continue getting following messages, and we cannot also stop these messages. megasas: cannot recover from previous reset failures end_request: I/O error, dev sda, sector 46755821 Buffer I/O error on device sda2, logical block 5818372 lost page write due to I/O error on sda2 Buffer I/O error on device sda2, logical block 5818373 ... end_request: I/O error, dev sda, sector 7024621 Aborting journal on device sda2. end_request: I/O error, dev sda, sector 7024653 ... EXT3-fs error (device sda2) in ext3_dirty_inode: Journal has aborted sd 0:2:0:0: rejecting I/O to offline device ext3_abort called. EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only end_request: I/O error, dev sda, sector 7024861 end_request: I/O error, dev sda, sector 7024893 EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted end_request: I/O error, dev sda, sector 7024965 ... Buffer I/O error on device sda2, logical block 1 lost page write due to I/O error on sda2 sd 0:2:0:0: rejecting I/O to offline device sd 0:2:0:0: rejecting I/O to offline device 4. As a result hard disk becomes off-line, we cannot execute even ps command, and we get followin message. #ps -bash: ps: command not found Expected results: Smartctl may print error messages, but should not make the device unoperational. Hardware info: As for this problem, reproduction is confirmed with the following platforms. Model : Express5800/120Rg-1 Cpu : Intel(R) Xeon(TM) CPU 3.20GHz x 1 Mem : 2GB kernel : 2.6.18-1.2839 HBA : LSI Logic MegaRAID 8300XLP Business impact: If customers use smartctl command on systems using MegaRAID SAS, all services of their system is stopped, which is a critical problem. This event sent from IssueTracker by dmilburn [Support Engineering Group] issue 109733
When issuing the "smartctl -a" we do see the Error handler waking up and a bus device reset being sent to the device, this is handled through megasas_generic_reset(). Looking at this function, the cmd is 0x37 which is an optional "read defect data" command. <Mar/12 12:29 pm>nec-em17.rhts.boston.redhat.com login: Error handler scsi_eh_0 waking up <Mar/12 12:29 pm>Total of 2 commands on 1 devices require eh work <Mar/12 12:29 pm>scsi_eh_0: aborting cmd:0xffff81012eaeb980 <Mar/12 12:29 pm>scsi_eh_0: aborting cmd failed:0xffff81012eaeb980 <Mar/12 12:29 pm>scsi_eh_0: aborting cmd:0xffff81012eaeb200 <Mar/12 12:29 pm>scsi_eh_0: aborting cmd failed:0xffff81012eaeb200 <Mar/12 12:29 pm>scsi_eh_0: Sending BDR sdev: 0xffff81012f394800 <Mar/12 12:29 pm>sd 0:2:0:0: megasas: RESET -172560 cmd=37 <Mar/12 12:29 pm>megasas: [ 0]waiting for 2 commands to complete
I commented out the portion of smartctl code to get the defect data, but system still gets into the bad state: /* if (SCSI_PT_DIRECT_ACCESS == peripheral_type) { scsiPrintGrownDefectListLen(fd); if (gSeagateCacheLPage) scsiPrintSeagateCacheLPage(fd); if (gSeagateFactoryLPage) scsiPrintSeagateFactoryLPage(fd); } */ # ./smartctl -a /dev/sda smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: LSI MegaRAID 8300XLP Version: 2.02 Device type: disk Local Time is: Thu Mar 15 14:57:39 2007 EDT Device does not support SMART Error Counter logging not supported Device does not support Self Test logging ===Here is what should happen=== # smartctl -a /dev/sda smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce AllenHome page is http://smartmontools.sourceforge.net/ Device: LSILOGIC 1030 IM IM Version: 1000 Device type: disk Local Time is: Thu Mar 15 15:55:34 2007 EDT Device does not support SMART Error Counter logging not supported Error Events logging not supported Device does not support Self Test logging
Console output for Comment#10, shows mode sense (0x1a) command login: sd 0:2:0:0: megasas: RESET -39422 cmd=1a megasas: [ 0]waiting for 11 commands to complete . . .
If con->checksmart is set to FALSE the system doesn't get into the bad state. This option should be sending "smartctl -a" down this path which is causing the controller to get into the bad state. scsiGetSmartData() -> scsiCheckIE() (read informational exception log page) =======smartctl.c======== case 'a': con->driveinfo = TRUE; con->checksmart = FALSE; con->generalsmartvalues = TRUE; con->smartvendorattrib = TRUE; con->smarterrorlog = TRUE; con->smartselftestlog = TRUE; con->selectivetestlog = TRUE; break; ========================
After setting con->checksmart back to TRUE, traced into scsiCheckIE(), since hasIELogPage is 0 the code is not actually do the LOG SENSE to read the informational exception page. scsiCheckIE: hasIELogPage 0 IE_LPAGE 0x2f <===== The scsiCheckIE function will instead do a Request Sense and this is what is causing the device to get into the wierd state. If I comment out this portion there is no problem. if (0 == sense_info.asc) { /* ties in with MRIE field of 6 in IEC mode page (0x1c) */ if ((err = scsiRequestSense(device, &sense_info))) { pout("Request Sense failed, [%s]\n", scsiErrString(err)); return err; } } scsiRequestSense: cdb[0] 0x3 cbd[4] 18 scsiRequestSense: io_hdr.cmnd_len 6 io_hdr.max_sense_len 32 io_hdr.timeout 6 scsiRequestSense: io_hdr.dxfer_dir 1 io_hdr.dxfer_len 18
Kindly let me know the Firmware version of the adapter. Could you also check if there is any latest FW available for the adapter and try with it ?
Sumant, We are seeing a RHEL5 (2.6.18-8.el5) system LSI Logic MegaRAID 8300XLP running without problems until "smartctl -a /dev/sda" is executed. Then the system becomes unoperational, we see a reset and then waiting for commands to complete and eventually filesystem errors and device is marked off-line (see Comment#1). megasas: RESET -9459 cmd=2a megasas: [ 0]waiting for 25 commands to complete megasas: [ 5]waiting for 25 commands to complete This is also reproducible on an i686 system running 2.6.9-42.ELsmp. If I comment out the above smartctl code (Comment#13) to prevent the request sense then the system doesn't get into this bad state. Do you have any thoughts on this? Thank you.
I suspect the FW is not handling the cmd correctly and is not OS dependent. The latest FW may have already fixed it. Please let me know the FW version you are trying with and I will verify and get back to you.
Sumant, Here is the firmware version: ------------------------------------------------------------------------- LSI MegaRAID SAS-MFI BIOS Version MT25 (Build March 06, 2006) Copyright(c) 2006 LSI Logic Corporation HA -0 (Bus 8 Dev 3) MegaRAID SAS 8300XLP FW package: 5.0.1-0032 ------------------------------------------------------------------------- It does look like there is an updated firmware 5.1.1-0039.
I could not locate the package 5.0.1.0032 today. I verified with package 5.1.1.0033. Did not see the issue. So, kindly test with the package 5.1.1-0039.
Sumant, We found that updating the firmware to 5.1.1-0020 corrected the problem. After issuing the "smartctl -a /dev/sda" the system is still functional. Thanks alot for looking into this problem.
Closing bug, as updating the firmware solves this issue, and is not a Red Hat issue, AFAIK.