Bug 220368 - [RHEL5 RC Snapshot3] MegaRAID SAS becomes unoperational after smartctl access
[RHEL5 RC Snapshot3] MegaRAID SAS becomes unoperational after smartctl access
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.0
All Linux
high Severity high
: ---
: ---
Assigned To: Tom Coughlan
Brian Brock
:
Depends On:
Blocks: 228988 243319 217103 227613
  Show dependency treegraph
 
Reported: 2006-12-20 13:43 EST by Issue Tracker
Modified: 2007-11-30 17:07 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-03-27 12:40:36 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Issue Tracker 2006-12-20 13:43:48 EST
Escalated to Bugzilla from IssueTracker
Comment 1 Issue Tracker 2006-12-20 13:44:01 EST
Description of problem:
  When we execute smartctl command to hard disk connected by MegaRAID-SAS controller, this hard disk becomes off-line and unoperational.

Version-Release number of selected component:
  RHEL5 Server RC Snapshot3
    kernel version: 2.6.18-1.2839.el5
    smartctl version: 5.36-3.1.el5

How reproducible:
  100%

Steps to Reproduce:
  1. Execute smartctl command to /dev/sda contained Linux system.
        # smartctl -a /dev/sda

  2.Wait for a while.

Actual results:
  1. We get following messages after executing smartctl command.

     smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
     Home page is http://smartmontools.sourceforge.net/

     Device: LSI      MegaRAID 8300XLP Version: 2.02
     Device type: disk
     Local Time is: Fri Dec 15 18:31:57 2006 JST
     Device does not support SMART

     Error Counter logging not supported
     Device does not support Self Test logging

  2. Wait for a while, we continue getting following messages, and we cannot stop these messages.

     megasas: [ 0]waiting for 25 commands to complete
     megasas: [ 5]waiting for 25 commands to complete
     megasas: [10]waiting for 25 commands to complete
     megasas: [15]waiting for 25 commands to complete
     megasas: [20]waiting for 25 commands to complete
     megasas: [25]waiting for 25 commands to complete
                      ...

  3. If we execute shutdown command, we continue getting following messages, and we cannot also stop these messages.

     megasas: cannot recover from previous reset failures
     end_request: I/O error, dev sda, sector 46755821
     Buffer I/O error on device sda2, logical block 5818372
     lost page write due to I/O error on sda2
     Buffer I/O error on device sda2, logical block 5818373
                     ...
     end_request: I/O error, dev sda, sector 7024621
     Aborting journal on device sda2.
     end_request: I/O error, dev sda, sector 7024653
                     ...
     EXT3-fs error (device sda2) in ext3_dirty_inode: Journal has aborted
     sd 0:2:0:0: rejecting I/O to offline device
     ext3_abort called.
     EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal
     Remounting filesystem read-only
     end_request: I/O error, dev sda, sector 7024861
     end_request: I/O error, dev sda, sector 7024893
     EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted
     EXT3-fs error (device sda2) in ext3_reserve_inode_write: Journal has aborted
     end_request: I/O error, dev sda, sector 7024965
                   ...
     Buffer I/O error on device sda2, logical block 1
     lost page write due to I/O error on sda2
     sd 0:2:0:0: rejecting I/O to offline device
     sd 0:2:0:0: rejecting I/O to offline device

  4. As a result hard disk becomes off-line, we cannot execute even ps command, and we get followin message. 
    #ps
    -bash: ps: command not found

Expected results:
 Smartctl may print error messages, but should not make the device unoperational.

Hardware info:
  As for this problem, reproduction is confirmed with the following platforms.

   Model   : Express5800/120Rg-1
   Cpu     : Intel(R) Xeon(TM) CPU 3.20GHz x 1
   Mem     : 2GB
   kernel  : 2.6.18-1.2839
   HBA    : LSI Logic MegaRAID 8300XLP

Business impact:
  If customers use smartctl command on systems using MegaRAID SAS, all services of their system is stopped, which is a critical problem.

This event sent from IssueTracker by dmilburn  [Support Engineering Group]
 issue 109733
Comment 9 David Milburn 2007-03-12 14:16:28 EDT
When issuing the "smartctl -a" we do see the Error handler waking up and a bus
device reset being sent to the device, this is handled through
megasas_generic_reset(). Looking at this function, the cmd is 0x37 which is an
optional "read defect data" command.

<Mar/12 12:29 pm>nec-em17.rhts.boston.redhat.com login: Error handler scsi_eh_0
waking up
<Mar/12 12:29 pm>Total of 2 commands on 1 devices require eh work
<Mar/12 12:29 pm>scsi_eh_0: aborting cmd:0xffff81012eaeb980
<Mar/12 12:29 pm>scsi_eh_0: aborting cmd failed:0xffff81012eaeb980
<Mar/12 12:29 pm>scsi_eh_0: aborting cmd:0xffff81012eaeb200
<Mar/12 12:29 pm>scsi_eh_0: aborting cmd failed:0xffff81012eaeb200
<Mar/12 12:29 pm>scsi_eh_0: Sending BDR sdev: 0xffff81012f394800
<Mar/12 12:29 pm>sd 0:2:0:0: megasas: RESET -172560 cmd=37
<Mar/12 12:29 pm>megasas: [ 0]waiting for 2 commands to complete
Comment 10 David Milburn 2007-03-15 15:13:09 EDT
I commented out the portion of smartctl code to get the defect data, but system
still gets into the bad state:

        /*
        if (SCSI_PT_DIRECT_ACCESS == peripheral_type) {
            scsiPrintGrownDefectListLen(fd);
            if (gSeagateCacheLPage)
                scsiPrintSeagateCacheLPage(fd);
            if (gSeagateFactoryLPage)
                scsiPrintSeagateFactoryLPage(fd);
        }
        */

# ./smartctl -a /dev/sda
smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: LSI      MegaRAID 8300XLP Version: 2.02
Device type: disk
Local Time is: Thu Mar 15 14:57:39 2007 EDT
Device does not support SMART

Error Counter logging not supported
Device does not support Self Test logging

===Here is what should happen===
# smartctl -a /dev/sda
smartctl version 5.33 [x86_64-redhat-linux-gnu] Copyright (C) 2002-4 Bruce
AllenHome page is http://smartmontools.sourceforge.net/

Device: LSILOGIC 1030 IM       IM Version: 1000
Device type: disk
Local Time is: Thu Mar 15 15:55:34 2007 EDT
Device does not support SMART

Error Counter logging not supported

Error Events logging not supported
Device does not support Self Test logging

Comment 11 David Milburn 2007-03-15 15:28:23 EDT
Console output for Comment#10, shows mode sense (0x1a) command

login: sd 0:2:0:0: megasas: RESET -39422 cmd=1a
megasas: [ 0]waiting for 11 commands to complete
  .
  .
  .
Comment 12 David Milburn 2007-03-15 17:44:53 EDT
If con->checksmart is set to FALSE the system doesn't get into the bad state.

This option should be sending "smartctl -a" down this path which is causing
the controller to get into the bad state.

scsiGetSmartData() -> scsiCheckIE() (read informational exception log page) 

=======smartctl.c========
    case 'a':
      con->driveinfo          = TRUE;
      con->checksmart         = FALSE;
      con->generalsmartvalues = TRUE;
      con->smartvendorattrib  = TRUE;
      con->smarterrorlog      = TRUE;
      con->smartselftestlog   = TRUE;
      con->selectivetestlog   = TRUE;
      break;
========================
Comment 13 David Milburn 2007-03-16 16:29:57 EDT
After setting con->checksmart back to TRUE, traced into scsiCheckIE(), since
hasIELogPage is 0 the code is not actually do the LOG SENSE to read the
informational exception page. 

scsiCheckIE: hasIELogPage 0 IE_LPAGE 0x2f <=====

The scsiCheckIE function will instead do a Request Sense and this is what is
causing the device to get into the wierd state. If I comment out this portion
there is no problem.

    if (0 == sense_info.asc) {
        /* ties in with MRIE field of 6 in IEC mode page (0x1c) */
        if ((err = scsiRequestSense(device, &sense_info))) {
            pout("Request Sense failed, [%s]\n", scsiErrString(err));
            return err;
        }
    }

scsiRequestSense: cdb[0] 0x3 cbd[4] 18
scsiRequestSense: io_hdr.cmnd_len 6 io_hdr.max_sense_len 32 io_hdr.timeout 6
scsiRequestSense: io_hdr.dxfer_dir 1 io_hdr.dxfer_len 18

Comment 17 Sumant Patro 2007-03-23 13:13:52 EDT
Kindly let me know the Firmware version of the adapter.

Could you also check if there is any latest FW available for the adapter and try
with it ?

Comment 18 David Milburn 2007-03-23 13:15:32 EDT
Sumant,

We are seeing a RHEL5 (2.6.18-8.el5) system LSI Logic MegaRAID 8300XLP running
without problems until "smartctl -a /dev/sda" is executed. Then the system
becomes unoperational, we see a reset and then waiting for commands to complete
and eventually filesystem errors and device is marked off-line (see Comment#1).

megasas: RESET -9459 cmd=2a
megasas: [ 0]waiting for 25 commands to complete
megasas: [ 5]waiting for 25 commands to complete

This is also reproducible on an i686 system running 2.6.9-42.ELsmp.

If I comment out the above smartctl code (Comment#13) to prevent the request
sense then the system doesn't get into this bad state. Do you have any thoughts
on this? Thank you.
 
Comment 19 Sumant Patro 2007-03-23 13:23:00 EDT
I suspect the FW is not handling the cmd correctly and is not OS dependent.

The latest FW may have already fixed it. Please let me know the FW version you
are trying with and I will verify and get back to you.
Comment 20 David Milburn 2007-03-23 14:20:44 EDT
Sumant,

Here is the firmware version:
-------------------------------------------------------------------------
LSI MegaRAID SAS-MFI BIOS Version MT25 (Build March 06, 2006)
Copyright(c) 2006 LSI Logic Corporation

HA -0 (Bus 8 Dev 3) MegaRAID SAS 8300XLP
FW package: 5.0.1-0032
-------------------------------------------------------------------------

It does look like there is an updated firmware 5.1.1-0039.
Comment 21 Sumant Patro 2007-03-23 20:07:31 EDT

I could not locate the package 5.0.1.0032 today.

I verified with package 5.1.1.0033. Did not see the issue.

So, kindly test with the package 5.1.1-0039.
Comment 22 David Milburn 2007-03-27 11:24:11 EDT
Sumant,

We found that updating the firmware to 5.1.1-0020 corrected the problem. After
issuing the "smartctl -a /dev/sda" the system is still functional.

Thanks alot for looking into this problem.
Comment 23 Andrius Benokraitis 2007-03-27 12:40:36 EDT
Closing bug, as updating the firmware solves this issue, and is not a Red Hat
issue, AFAIK.

Note You need to log in before you can comment on or make changes to this bug.