220368 – [RHEL5 RC Snapshot3] MegaRAID SAS becomes unoperational after smartctl access

Bug 220368 - [RHEL5 RC Snapshot3] MegaRAID SAS becomes unoperational after smartctl access

Summary: [RHEL5 RC Snapshot3] MegaRAID SAS becomes unoperational after smartctl access

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Tom Coughlan
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	217103 227613 228988 243319
TreeView+	depends on / blocked

Reported:	2006-12-20 18:43 UTC by Issue Tracker
Modified:	2007-11-30 22:07 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-03-27 16:40:36 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Issue Tracker 2006-12-20 18:43:48 UTC

Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2006-12-20 18:44:01 UTC

Description of problem:
When we execute smartctl command to hard disk connected by MegaRAID-SAS controller, this hard disk becomes off-line and unoperational.

Version-Release number of selected component:
RHEL5 Server RC Snapshot3
kernel version: 2.6.18-1.2839.el5
smartctl version: 5.36-3.1.el5

How reproducible:
100%

Steps to Reproduce:
1. Execute smartctl command to /dev/sda contained Linux system.
# smartctl -a /dev/sda

2.Wait for a while.

Actual results:
1. We get following messages after executing smartctl command.

smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: LSI MegaRAID 8300XLP Version: 2.02
Device type: disk
Local Time is: Fri Dec 15 18:31:57 2006 JST
Device does not support SMART

Error Counter logging not supported
Device does not support Self Test logging

2. Wait for a while, we continue getting following messages, and we cannot stop these messages.

megasas: [ 0]waiting for 25 commands to complete
megasas: [ 5]waiting for 25 commands to complete
megasas: [10]waiting for 25 commands to complete
megasas: [15]waiting for 25 commands to complete
megasas: [20]waiting for 25 commands to complete
megasas: [25]waiting for 25 commands to complete
...

3. If we execute shutdown command, we continue getting following messages, and we cannot also stop these messages.

megasas: cannot recover from previous reset failures
end_request: I/O error, dev sda, sector 46755821
Buffer I/O error on device sda2, logical block 5818372
lost page write due to I/O error on sda2
Buffer I/O error on device sda2, logical block 5818373
...
end_request: I/O error, dev sda, sector 7024621
Aborting journal on device sda2.
end_request: I/O error, dev sda, sector 7024653
...
EXT3-fs error (device sda2) in ext3_dirty_inode: Journal has aborted
sd 0:2:0:0: rejecting I/O to offline device
ext3_abort called.
EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
end_request: I/O error, dev sda, sector 7024861
end_request: I/O error, dev sda, sector 7024893
EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted
end_request: I/O error, dev sda, sector 7024965
...
Buffer I/O error on device sda2, logical block 1
lost page write due to I/O error on sda2
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device

4. As a result hard disk becomes off-line, we cannot execute even ps command, and we get followin message.
#ps
-bash: ps: command not found

Expected results:
Smartctl may print error messages, but should not make the device unoperational.

Hardware info:
As for this problem, reproduction is confirmed with the following platforms.

Model : Express5800/120Rg-1
Cpu : Intel(R) Xeon(TM) CPU 3.20GHz x 1
Mem : 2GB
kernel : 2.6.18-1.2839
HBA : LSI Logic MegaRAID 8300XLP

Business impact:
If customers use smartctl command on systems using MegaRAID SAS, all services of their system is stopped, which is a critical problem.

This event sent from IssueTracker by dmilburn [Support Engineering Group]
issue 109733

Comment 9 David Milburn 2007-03-12 18:16:28 UTC

When issuing the "smartctl -a" we do see the Error handler waking up and a bus
device reset being sent to the device, this is handled through
megasas_generic_reset(). Looking at this function, the cmd is 0x37 which is an
optional "read defect data" command.

<Mar/12 12:29 pm>nec-em17.rhts.boston.redhat.com login: Error handler scsi_eh_0
waking up
<Mar/12 12:29 pm>Total of 2 commands on 1 devices require eh work
<Mar/12 12:29 pm>scsi_eh_0: aborting cmd:0xffff81012eaeb980
<Mar/12 12:29 pm>scsi_eh_0: aborting cmd failed:0xffff81012eaeb980
<Mar/12 12:29 pm>scsi_eh_0: aborting cmd:0xffff81012eaeb200
<Mar/12 12:29 pm>scsi_eh_0: aborting cmd failed:0xffff81012eaeb200
<Mar/12 12:29 pm>scsi_eh_0: Sending BDR sdev: 0xffff81012f394800
<Mar/12 12:29 pm>sd 0:2:0:0: megasas: RESET -172560 cmd=37
<Mar/12 12:29 pm>megasas: [ 0]waiting for 2 commands to complete

Comment 10 David Milburn 2007-03-15 19:13:09 UTC

I commented out the portion of smartctl code to get the defect data, but system
still gets into the bad state:

        /*
        if (SCSI_PT_DIRECT_ACCESS == peripheral_type) {
            scsiPrintGrownDefectListLen(fd);
            if (gSeagateCacheLPage)
                scsiPrintSeagateCacheLPage(fd);
            if (gSeagateFactoryLPage)
                scsiPrintSeagateFactoryLPage(fd);
        }
        */

# ./smartctl -a /dev/sda
smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: LSI      MegaRAID 8300XLP Version: 2.02
Device type: disk
Local Time is: Thu Mar 15 14:57:39 2007 EDT
Device does not support SMART

Error Counter logging not supported
Device does not support Self Test logging

===Here is what should happen===
# smartctl -a /dev/sda
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce
AllenHome page is http://smartmontools.sourceforge.net/

Device: LSILOGIC 1030 IM       IM Version: 1000
Device type: disk
Local Time is: Thu Mar 15 15:55:34 2007 EDT
Device does not support SMART

Error Counter logging not supported

Error Events logging not supported
Device does not support Self Test logging

Comment 11 David Milburn 2007-03-15 19:28:23 UTC

Console output for Comment#10, shows mode sense (0x1a) command

login: sd 0:2:0:0: megasas: RESET -39422 cmd=1a
megasas: [ 0]waiting for 11 commands to complete
  .
  .
  .

Comment 12 David Milburn 2007-03-15 21:44:53 UTC

If con->checksmart is set to FALSE the system doesn't get into the bad state.

This option should be sending "smartctl -a" down this path which is causing
the controller to get into the bad state.

scsiGetSmartData() -> scsiCheckIE() (read informational exception log page) 

=======smartctl.c========
    case 'a':
      con->driveinfo          = TRUE;
      con->checksmart         = FALSE;
      con->generalsmartvalues = TRUE;
      con->smartvendorattrib  = TRUE;
      con->smarterrorlog      = TRUE;
      con->smartselftestlog   = TRUE;
      con->selectivetestlog   = TRUE;
      break;
========================

Comment 13 David Milburn 2007-03-16 20:29:57 UTC

After setting con->checksmart back to TRUE, traced into scsiCheckIE(), since
hasIELogPage is 0 the code is not actually do the LOG SENSE to read the
informational exception page. 

scsiCheckIE: hasIELogPage 0 IE_LPAGE 0x2f <=====

The scsiCheckIE function will instead do a Request Sense and this is what is
causing the device to get into the wierd state. If I comment out this portion
there is no problem.

    if (0 == sense_info.asc) {
        /* ties in with MRIE field of 6 in IEC mode page (0x1c) */
        if ((err = scsiRequestSense(device, &sense_info))) {
            pout("Request Sense failed, [%s]\n", scsiErrString(err));
            return err;
        }
    }

scsiRequestSense: cdb[0] 0x3 cbd[4] 18
scsiRequestSense: io_hdr.cmnd_len 6 io_hdr.max_sense_len 32 io_hdr.timeout 6
scsiRequestSense: io_hdr.dxfer_dir 1 io_hdr.dxfer_len 18

Comment 17 Sumant Patro 2007-03-23 17:13:52 UTC

Kindly let me know the Firmware version of the adapter.

Could you also check if there is any latest FW available for the adapter and try
with it ?

Comment 18 David Milburn 2007-03-23 17:15:32 UTC

Sumant,

We are seeing a RHEL5 (2.6.18-8.el5) system LSI Logic MegaRAID 8300XLP running
without problems until "smartctl -a /dev/sda" is executed. Then the system
becomes unoperational, we see a reset and then waiting for commands to complete
and eventually filesystem errors and device is marked off-line (see Comment#1).

megasas: RESET -9459 cmd=2a
megasas: [ 0]waiting for 25 commands to complete
megasas: [ 5]waiting for 25 commands to complete

This is also reproducible on an i686 system running 2.6.9-42.ELsmp.

If I comment out the above smartctl code (Comment#13) to prevent the request
sense then the system doesn't get into this bad state. Do you have any thoughts
on this? Thank you.

Comment 19 Sumant Patro 2007-03-23 17:23:00 UTC

I suspect the FW is not handling the cmd correctly and is not OS dependent.

The latest FW may have already fixed it. Please let me know the FW version you
are trying with and I will verify and get back to you.

Comment 20 David Milburn 2007-03-23 18:20:44 UTC

Sumant,

Here is the firmware version:
-------------------------------------------------------------------------
LSI MegaRAID SAS-MFI BIOS Version MT25 (Build March 06, 2006)
Copyright(c) 2006 LSI Logic Corporation

HA -0 (Bus 8 Dev 3) MegaRAID SAS 8300XLP
FW package: 5.0.1-0032
-------------------------------------------------------------------------

It does look like there is an updated firmware 5.1.1-0039.

Comment 21 Sumant Patro 2007-03-24 00:07:31 UTC


I could not locate the package 5.0.1.0032 today.

I verified with package 5.1.1.0033. Did not see the issue.

So, kindly test with the package 5.1.1-0039.

Comment 22 David Milburn 2007-03-27 15:24:11 UTC

Sumant,

We found that updating the firmware to 5.1.1-0020 corrected the problem. After
issuing the "smartctl -a /dev/sda" the system is still functional.

Thanks alot for looking into this problem.

Comment 23 Andrius Benokraitis 2007-03-27 16:40:36 UTC

Closing bug, as updating the firmware solves this issue, and is not a Red Hat
issue, AFAIK.

Note You need to log in before you can comment on or make changes to this bug.