Bug 195027 - Onboar LSI Megaraid in FSC Primergy TX600S2 hangs about once a month
Summary: Onboar LSI Megaraid in FSC Primergy TX600S2 hangs about once a month
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Chip Coldwell
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 217099
TreeView+ depends on / blocked
 
Reported: 2006-06-14 14:05 UTC by Wojciech Wrobel
Modified: 2007-11-30 22:07 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-03-09 19:57:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Message log containing the messages related to the problem (418.22 KB, image/png)
2006-06-14 14:05 UTC, Wojciech Wrobel
no flags Details

Description Wojciech Wrobel 2006-06-14 14:05:11 UTC
Description of problem:

Onboard LSI Megaraid in FSC Primergy TX600S2 (4 dualcore Xeons, 32GB of RAM) 
hangs about once a month with following messages in the message log:

Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094823:52[255:0], fw owner
Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094868 cmd=2a <c=2 t=0 l=0>
Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094868:20[255:0], fw owner
Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094869 cmd=2a <c=2 t=0 l=0>
Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094869:19[255:0], fw owner
Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094870 cmd=2a <c=2 t=0 l=0>
Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094870:38[255:0], fw owner
Jun 11 17:58:42 fin2 kernel: megaraid: resetting the host...
Jun 11 17:58:42 fin2 kernel: megaraid: 18 outstanding commands. Max wait 180 
sec
Jun 11 17:58:42 fin2 kernel: megaraid mbox: Wait for 18 commands to 
complete:180
Jun 11 17:58:42 fin2 kernel: megaraid mbox: Wait for 18 commands to 
complete:175
Jun 11 17:58:42 fin2 kernel: megaraid mbox: Wait for 18 commands to 
complete:170
Jun 11 17:58:42 fin2 kernel: megaraid mbox: reset sequence completed 
successfully
Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094959 cmd=2a <c=2 t=0 l=0>
Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094959:38[255:0], fw owner
Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094960 cmd=2a <c=2 t=0 l=0>
Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094960:19[255:0], fw owner
Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094961 cmd=2a <c=2 t=0 l=0>

After that server is almost dead - every disk operation - even very simple -
takes several seconds, but it seems, that at least logs are succesfully writen 
to the disk.

Problem is solved by reseting the machine.


Version-Release number of selected component (if applicable):

Kernels: 2.6.9-22.ELhugemem & 2.6.9-34.ELhugemem
ROMB fw version:[516D] bios version:[H430]


How reproducible:

Hard - without specific reason this ocurrs about once a month.



Additional info:

Similar problems are reported on variuos servers using this driver.

Comment 1 Wojciech Wrobel 2006-06-14 14:05:12 UTC
Created attachment 130851 [details]
Message log containing the messages related to the problem

Comment 2 Tom Coughlan 2006-06-15 13:27:17 UTC
I am checking with LSI Logic on this. 

Comment 4 David Milburn 2007-01-23 23:21:34 UTC
Sumant,

This same error has been seen on 2.6.9-42.ELsmp, do you think this problem maybe
related specifically to this hardware/firmware?

===>RAID bus controller: LSI Logic / Symbios Logic MegaRAID (rev 07) 
    Subsystem: Fujitsu Siemens Computer GmbH FSC MegaRAID PCI Express ROMB

===>megaraid: fw version:[516D] bios version:[H430]


Comment 5 Sumant Patro 2007-01-24 03:47:56 UTC
From the information it is difficult to tell if it is specific to HW/FW.
From the log it seems FW is not able to complete commands regularly within the 
timeout period. 
Is this seen in only one setup or more than one ?
Is there any Background FW operations (example : Check Consisteny) running in 
the adapter when this happens ?
Can we get more details on the RAID configuration of the system?



Comment 6 David Milburn 2007-01-24 21:24:56 UTC
Sumant,

Thanks for the feedback, the problem has been seen on this system running -22.EL
and -34.EL with 32GB RAM, and we have another report of this occurring on -42.EL
with 8GB RAM. I don't have hands on access to the system, but, will gather more
details on the RAID configuration and furthur debug.

Comment 7 Gerhard Hagn 2007-01-25 12:15:10 UTC
Below you find our RAID configuration. We have seen this problem on RX600S2 with
4 CPUs and 8GB RAM with all kernels mentioned above.

#/usr/local/bin/megarc -dispCfg -a0


        **********************************************************************
              MEGARC MegaRAID Configuration Utility(LINUX)-1.11(12-07-2004)
              By LSI Logic Corp.,USA
        **********************************************************************
          [Note: For SATA-2, 4 and 6 channel controllers, please specify
          Ch=0 Id=0..15 for specifying physical drive(Ch=channel, Id=Target)]

        Type ? as command line arg for help


        Finding Devices On Each MegaRAID Adapter...
        Scanning Ha 0, Chnl 1 Target 15


        **********************************************************************
              Existing Logical Drive Information
              By LSI Logic Corp.,USA
        **********************************************************************
          [Note: For SATA-2, 4 and 6 channel controllers, please specify
          Ch=0 Id=0..15 for specifying physical drive(Ch=channel, Id=Target)]


          Logical Drive : 0( Adapter: 0 ):  Status: OPTIMAL
        ---------------------------------------------------
        SpanDepth :01     RaidLevel: 5  RdAhead : Adaptive  Cache: CachedIo
        StripSz   :064KB   Stripes  : 3  WrPolicy: WriteBack

        Logical Drive 0 : SpanLevel_0 Disks
        Chnl  Target  StartBlock   Blocks      Physical Target Status
        ----  ------  ----------   ------      ----------------------
        0      00    0x00000000   0x0878c000   ONLINE
        0      01    0x00000000   0x0878c000   ONLINE
        1      00    0x00000000   0x0878c000   ONLINE


Comment 8 Andrius Benokraitis 2007-02-22 20:52:37 UTC
Sumant, any comments? The door is closing for RHEL 4.5 on this and need to take
a closer look at this...

Comment 10 Sumant Patro 2007-02-23 00:08:51 UTC
I could not duplicate the issue.

From the log we see FW is not completing commands within the timeout period. 
Can we request the user to upgrade Firmware in the system ?




Comment 11 Andrius Benokraitis 2007-03-02 06:29:49 UTC
At this point there isn't time to address this in RHEL 4.5 since the problem
cannot be replicated by LSI. 

David, from Comment #6, did you end up finding out more details regarding this
issue?

Comment 12 David Milburn 2007-03-02 15:45:44 UTC
Andrius, no the customer never responded back. I see Gerhard Hagn reported his
raid configuration in Comment #7, Gerhard can you verify that you are running
the latest firmware?

Comment 13 Gerhard Hagn 2007-03-05 08:12:14 UTC
We upgraded to the current Firmware from Siemens-Fujitsu (megaraid: fw version:
[516G] bios version:[H430]) about 3 weeks ago. So far things look good. I hope 
it remains that way.


Note You need to log in before you can comment on or make changes to this bug.