Description of problem: Onboard LSI Megaraid in FSC Primergy TX600S2 (4 dualcore Xeons, 32GB of RAM) hangs about once a month with following messages in the message log: Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094823:52[255:0], fw owner Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094868 cmd=2a <c=2 t=0 l=0> Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094868:20[255:0], fw owner Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094869 cmd=2a <c=2 t=0 l=0> Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094869:19[255:0], fw owner Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094870 cmd=2a <c=2 t=0 l=0> Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094870:38[255:0], fw owner Jun 11 17:58:42 fin2 kernel: megaraid: resetting the host... Jun 11 17:58:42 fin2 kernel: megaraid: 18 outstanding commands. Max wait 180 sec Jun 11 17:58:42 fin2 kernel: megaraid mbox: Wait for 18 commands to complete:180 Jun 11 17:58:42 fin2 kernel: megaraid mbox: Wait for 18 commands to complete:175 Jun 11 17:58:42 fin2 kernel: megaraid mbox: Wait for 18 commands to complete:170 Jun 11 17:58:42 fin2 kernel: megaraid mbox: reset sequence completed successfully Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094959 cmd=2a <c=2 t=0 l=0> Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094959:38[255:0], fw owner Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094960 cmd=2a <c=2 t=0 l=0> Jun 11 17:58:42 fin2 kernel: megaraid abort: 241094960:19[255:0], fw owner Jun 11 17:58:42 fin2 kernel: megaraid: aborting-241094961 cmd=2a <c=2 t=0 l=0> After that server is almost dead - every disk operation - even very simple - takes several seconds, but it seems, that at least logs are succesfully writen to the disk. Problem is solved by reseting the machine. Version-Release number of selected component (if applicable): Kernels: 2.6.9-22.ELhugemem & 2.6.9-34.ELhugemem ROMB fw version:[516D] bios version:[H430] How reproducible: Hard - without specific reason this ocurrs about once a month. Additional info: Similar problems are reported on variuos servers using this driver.
Created attachment 130851 [details] Message log containing the messages related to the problem
I am checking with LSI Logic on this.
Sumant, This same error has been seen on 2.6.9-42.ELsmp, do you think this problem maybe related specifically to this hardware/firmware? ===>RAID bus controller: LSI Logic / Symbios Logic MegaRAID (rev 07) Subsystem: Fujitsu Siemens Computer GmbH FSC MegaRAID PCI Express ROMB ===>megaraid: fw version:[516D] bios version:[H430]
From the information it is difficult to tell if it is specific to HW/FW. From the log it seems FW is not able to complete commands regularly within the timeout period. Is this seen in only one setup or more than one ? Is there any Background FW operations (example : Check Consisteny) running in the adapter when this happens ? Can we get more details on the RAID configuration of the system?
Sumant, Thanks for the feedback, the problem has been seen on this system running -22.EL and -34.EL with 32GB RAM, and we have another report of this occurring on -42.EL with 8GB RAM. I don't have hands on access to the system, but, will gather more details on the RAID configuration and furthur debug.
Below you find our RAID configuration. We have seen this problem on RX600S2 with 4 CPUs and 8GB RAM with all kernels mentioned above. #/usr/local/bin/megarc -dispCfg -a0 ********************************************************************** MEGARC MegaRAID Configuration Utility(LINUX)-1.11(12-07-2004) By LSI Logic Corp.,USA ********************************************************************** [Note: For SATA-2, 4 and 6 channel controllers, please specify Ch=0 Id=0..15 for specifying physical drive(Ch=channel, Id=Target)] Type ? as command line arg for help Finding Devices On Each MegaRAID Adapter... Scanning Ha 0, Chnl 1 Target 15 ********************************************************************** Existing Logical Drive Information By LSI Logic Corp.,USA ********************************************************************** [Note: For SATA-2, 4 and 6 channel controllers, please specify Ch=0 Id=0..15 for specifying physical drive(Ch=channel, Id=Target)] Logical Drive : 0( Adapter: 0 ): Status: OPTIMAL --------------------------------------------------- SpanDepth :01 RaidLevel: 5 RdAhead : Adaptive Cache: CachedIo StripSz :064KB Stripes : 3 WrPolicy: WriteBack Logical Drive 0 : SpanLevel_0 Disks Chnl Target StartBlock Blocks Physical Target Status ---- ------ ---------- ------ ---------------------- 0 00 0x00000000 0x0878c000 ONLINE 0 01 0x00000000 0x0878c000 ONLINE 1 00 0x00000000 0x0878c000 ONLINE
Sumant, any comments? The door is closing for RHEL 4.5 on this and need to take a closer look at this...
I could not duplicate the issue. From the log we see FW is not completing commands within the timeout period. Can we request the user to upgrade Firmware in the system ?
At this point there isn't time to address this in RHEL 4.5 since the problem cannot be replicated by LSI. David, from Comment #6, did you end up finding out more details regarding this issue?
Andrius, no the customer never responded back. I see Gerhard Hagn reported his raid configuration in Comment #7, Gerhard can you verify that you are running the latest firmware?
We upgraded to the current Firmware from Siemens-Fujitsu (megaraid: fw version: [516G] bios version:[H430]) about 3 weeks ago. So far things look good. I hope it remains that way.