Bug 195027 - Onboar LSI Megaraid in FSC Primergy TX600S2 hangs about once a month
Onboar LSI Megaraid in FSC Primergy TX600S2 hangs about once a month
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Chip Coldwell
Brian Brock
:
Depends On:
Blocks: 217099
  Show dependency treegraph
 
Reported: 2006-06-14 10:05 EDT by Wojciech Wrobel
Modified: 2007-11-30 17:07 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-03-09 14:57:22 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Message log containing the messages related to the problem (418.22 KB, image/png)
2006-06-14 10:05 EDT, Wojciech Wrobel
no flags Details

  None (edit)
Description Wojciech Wrobel 2006-06-14 10:05:11 EDT
Description of problem:

Onboard LSI Megaraid in FSC Primergy TX600S2 (4 dualcore Xeons, 32GB of RAM) 
hangs about once a month with following messages in the message log:

Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094823:52[255:0], fw owner
Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094868 cmd=2a <c=2 t=0 l=0>
Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094868:20[255:0], fw owner
Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094869 cmd=2a <c=2 t=0 l=0>
Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094869:19[255:0], fw owner
Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094870 cmd=2a <c=2 t=0 l=0>
Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094870:38[255:0], fw owner
Jun 11 17:58:42 fin2 kernel: megaraid: resetting the host...
Jun 11 17:58:42 fin2 kernel: megaraid: 18 outstanding commands. Max wait 180 
sec
Jun 11 17:58:42 fin2 kernel: megaraid mbox: Wait for 18 commands to 
complete:180
Jun 11 17:58:42 fin2 kernel: megaraid mbox: Wait for 18 commands to 
complete:175
Jun 11 17:58:42 fin2 kernel: megaraid mbox: Wait for 18 commands to 
complete:170
Jun 11 17:58:42 fin2 kernel: megaraid mbox: reset sequence completed 
successfully
Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094959 cmd=2a <c=2 t=0 l=0>
Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094959:38[255:0], fw owner
Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094960 cmd=2a <c=2 t=0 l=0>
Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094960:19[255:0], fw owner
Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094961 cmd=2a <c=2 t=0 l=0>

After that server is almost dead - every disk operation - even very simple -
takes several seconds, but it seems, that at least logs are succesfully writen 
to the disk.

Problem is solved by reseting the machine.


Version-Release number of selected component (if applicable):

Kernels: 2.6.9-22.ELhugemem & 2.6.9-34.ELhugemem
ROMB fw version:[516D] bios version:[H430]


How reproducible:

Hard - without specific reason this ocurrs about once a month.



Additional info:

Similar problems are reported on variuos servers using this driver.
Comment 1 Wojciech Wrobel 2006-06-14 10:05:12 EDT
Created attachment 130851 [details]
Message log containing the messages related to the problem
Comment 2 Tom Coughlan 2006-06-15 09:27:17 EDT
I am checking with LSI Logic on this. 
Comment 4 David Milburn 2007-01-23 18:21:34 EST
Sumant,

This same error has been seen on 2.6.9-42.ELsmp, do you think this problem maybe
related specifically to this hardware/firmware?

===>RAID bus controller: LSI Logic / Symbios Logic MegaRAID (rev 07) 
    Subsystem: Fujitsu Siemens Computer GmbH FSC MegaRAID PCI Express ROMB

===>megaraid: fw version:[516D] bios version:[H430]
Comment 5 Sumant Patro 2007-01-23 22:47:56 EST
From the information it is difficult to tell if it is specific to HW/FW.
From the log it seems FW is not able to complete commands regularly within the 
timeout period. 
Is this seen in only one setup or more than one ?
Is there any Background FW operations (example : Check Consisteny) running in 
the adapter when this happens ?
Can we get more details on the RAID configuration of the system?

Comment 6 David Milburn 2007-01-24 16:24:56 EST
Sumant,

Thanks for the feedback, the problem has been seen on this system running -22.EL
and -34.EL with 32GB RAM, and we have another report of this occurring on -42.EL
with 8GB RAM. I don't have hands on access to the system, but, will gather more
details on the RAID configuration and furthur debug.
Comment 7 Gerhard Hagn 2007-01-25 07:15:10 EST
Below you find our RAID configuration. We have seen this problem on RX600S2 with
4 CPUs and 8GB RAM with all kernels mentioned above.

#/usr/local/bin/megarc -dispCfg -a0


        **********************************************************************
              MEGARC MegaRAID Configuration Utility(LINUX)-1.11(12-07-2004)
              By LSI Logic Corp.,USA
        **********************************************************************
          [Note: For SATA-2, 4 and 6 channel controllers, please specify
          Ch=0 Id=0..15 for specifying physical drive(Ch=channel, Id=Target)]

        Type ? as command line arg for help


        Finding Devices On Each MegaRAID Adapter...
        Scanning Ha 0, Chnl 1 Target 15


        **********************************************************************
              Existing Logical Drive Information
              By LSI Logic Corp.,USA
        **********************************************************************
          [Note: For SATA-2, 4 and 6 channel controllers, please specify
          Ch=0 Id=0..15 for specifying physical drive(Ch=channel, Id=Target)]


          Logical Drive : 0( Adapter: 0 ):  Status: OPTIMAL
        ---------------------------------------------------
        SpanDepth :01     RaidLevel: 5  RdAhead : Adaptive  Cache: CachedIo
        StripSz   :064KB   Stripes  : 3  WrPolicy: WriteBack

        Logical Drive 0 : SpanLevel_0 Disks
        Chnl  Target  StartBlock   Blocks      Physical Target Status
        ----  ------  ----------   ------      ----------------------
        0      00    0x00000000   0x0878c000   ONLINE
        0      01    0x00000000   0x0878c000   ONLINE
        1      00    0x00000000   0x0878c000   ONLINE
Comment 8 Andrius Benokraitis 2007-02-22 15:52:37 EST
Sumant, any comments? The door is closing for RHEL 4.5 on this and need to take
a closer look at this...
Comment 10 Sumant Patro 2007-02-22 19:08:51 EST
I could not duplicate the issue.

From the log we see FW is not completing commands within the timeout period. 
Can we request the user to upgrade Firmware in the system ?


Comment 11 Andrius Benokraitis 2007-03-02 01:29:49 EST
At this point there isn't time to address this in RHEL 4.5 since the problem
cannot be replicated by LSI. 

David, from Comment #6, did you end up finding out more details regarding this
issue?
Comment 12 David Milburn 2007-03-02 10:45:44 EST
Andrius, no the customer never responded back. I see Gerhard Hagn reported his
raid configuration in Comment #7, Gerhard can you verify that you are running
the latest firmware?
Comment 13 Gerhard Hagn 2007-03-05 03:12:14 EST
We upgraded to the current Firmware from Siemens-Fujitsu (megaraid: fw version:
[516G] bios version:[H430]) about 3 weeks ago. So far things look good. I hope 
it remains that way.

Note You need to log in before you can comment on or make changes to this bug.