Bug 167166

Summary: megaraid2 kernel panic
Product: Red Hat Enterprise Linux 3 Reporter: David Kostal <david.kostal>
Component: kernelAssignee: Tom Coughlan <coughlan>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: petrides
Target Milestone: ---   
Target Release: ---   
Hardware: ia32e   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-11 20:57:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kernel-2.4.21-35.EL+megaraid2-v2.10.10.1-1dkms crash
none
kernel-2.4.21-32.0.1.EL + RH megaraid v2.10.8.2-RH1 I/O error
none
kernel 2.4.21-35.EL + RH megaraid driver
none
Kernel-2.4.21-32.0.1.EL with dell's megaraid2-v2.10.10.1-1dkms crash dump
none
2.4.21-32.0.1.EL error with Patrol Read disabled none

Description David Kostal 2005-08-31 09:10:17 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050524 CentOS/1.0.4-1.4.1.centos3 Firefox/1.0.4

Description of problem:
After running bonnie++ for longer time in loop the kernel starts to report I/O errors on disk or panics.

Version-Release number of selected component (if applicable):
 2.4.21-35.EL, 2.4.21-32.0.1.EL and kernel-2.4.21-27.EL

How reproducible:
Sometimes

Steps to Reproduce:
1.run bonnie++ and wait for few hours

  

Actual Results:  I/O errors on disk or kernel panic

Expected Results:  system should run...

Additional info:

Host, OS info:
--------------
Dell PE 2850, 2xXeon, 12GB RAM, 1xPERC 4i/DC (1logical drive raid 10, scsi1, sdb), 1xPERC 4e/DC (1logical drive raid 10, scsi0, sda), 1xQLA2312 (no drives attached, scsi2)
BIOS A2

RedHat Enterprise Linux ES 3 U5 x86_64, latest updates

DELL PERC drivers:
megaraid2-v2.10.10.1-1dkms
dkms-2.0.5-1

Kernel from RHEL ES Beta:
kernel-2.4.21-35.EL

Filesystems: ext3, no sw raid, no lvm

reproducible: yes
After running 1 or 2 bonnie++ (one on sda, one on sdb) the kernel crashed after few hours of benchmarking in loop
I also tried kernels 2.4.21-35.EL, 2.4.21-32.0.1.EL and kernel-2.4.21-27.EL withe default megaraid2 from RH
Running bonnie++ on RH driver causes scsi I/O errors after some time of test.
Other symptoms:
It mmight happen that throuput for writing goes down to 100blocks/sec (according to iostat) even if there are many dirty buffers to be written, bonnie than takes more than night to finish. Reading from this disk (dd if=/dev/sda1) at the same time seems to improve write performance again.

Comment 1 Ernie Petrides 2005-08-31 18:48:41 UTC
Please attach console panic or oops output (capturing it with a serial
line if necessary).  Thanks in advance.

Comment 2 Tom Coughlan 2005-08-31 19:18:41 UTC
Also please post the I/O error messages you got while running the default
megaraid2 driver, as shipped by RH. This would be most helpful if you to use
your most recent RH kernel (2.4.21-35.EL? ). Thanks.

Comment 3 Ernie Petrides 2005-08-31 19:23:49 UTC
*** Bug 167167 has been marked as a duplicate of this bug. ***

Comment 4 David Kostal 2005-09-01 07:36:48 UTC
Created attachment 118334 [details]
kernel-2.4.21-35.EL+megaraid2-v2.10.10.1-1dkms crash

(partial) oops trace and messages from 2.4.21-35.EL+megaraid2-v2.10.10.1-1dkms

Comment 5 David Kostal 2005-09-01 07:39:47 UTC
I'm now running kernel-2.4.21-32.0.1.EL+megaraid2-v2.10.10.1-1dkms for 23 hours
without crash. I'll boot with RH megaraid2 driver and try to reproduce a crash.

Comment 6 David Kostal 2005-09-01 07:51:17 UTC
Created attachment 118335 [details]
kernel-2.4.21-32.0.1.EL + RH megaraid v2.10.8.2-RH1 I/O error

Here is an I/O error when running kernel-2.4.21-32.0.1.EL + default RH
megaraid2 (v2.10.8.2-RH1). I have only one I/O error here, because at that time
I didn't have remote syslog set up yet. Also all errors with default megaraid2
driver - as far as I have noticed - were only on hdb/scsi1 - internal perc
4i/DC.

Comment 7 David Kostal 2005-09-02 09:46:38 UTC
I wasn't able to reproduce the crash or I/O error now even after cca 23hours of
testing with kernel-2.4.21-35.EL + RH megaraid driver. I continue testing (now
with fresh reboot again).

However, what I experience every time is the slow down of the disk operations.
After reboot, both sda (on perc 4e/DC)  and sdb (on perc 4i/DC) are very fast,
when runnig bonnie++, sda can do cca 50-110k blocks per sec and sdb can do cca
40-50k bps (iostat 1). But if I run it first on one disk after that on the
socond one OR both at the same time, the performance goes down on both to max
3-4k bps and sometimes even doing nothing for a very long time. Although doing
'dd if=/dev/sda1 | cat - > /dev/null' (or sdb...) helps a bit for short time
(eg. from 100bps to 5k bps), it goes down to nearly-no-performance shortly
again. If one disk starts to be slow, the other one will become slow too. 

When running bonnies on both disk simultaneously (after reboot) the performance
for both disks is not at the highest level neither, but it "acceptable" :-/ as
it does cca 30-40k bps for both disks (counted together). After cca 30minutes it
 anyway goes down to max 3-4k bps.

For example, during todays night, I had 2 bonnies running and both were still
running (one was reading already...) after 16hours.

There is no background task running on the disks. Both raid controllers 

It might or might not be related to the crashing issue but this slow-down also
makes the machine unusable:(

The other problem I've seen is that interactive responsibility is really horible
with these disk  benchmarks running. The machine is not swapping but the shells
might take up to 10 minutes to become responsible. However (tested only once) it
became much better when I executed
echo 1 > /proc/sys/vm/skip_mapped_pages
echo 1 10 15 > /proc/sys/vm/pagecache
echo "30 500 0 0 500 3000 80 50 0" > /proc/sys/vm/bdflush
as was advised in some other bugs here.



Comment 8 David Kostal 2005-09-09 12:54:29 UTC
Hi again. Unfortunately I wasn't able to reproduce crash in the last week
(before I reported this it crashed with all but one kernel (2.4.21-32.0.1.EL +
megaraid2 from Dell) I tried. Difference is only that now I run without
hyperthreading but in the past I had one crash also without HT.

What remains are the performance problems. If I  boot with mem=3G there is no
performance degradation at all.

If I boot with mem=6G the speed goes down to 5-15k bps on sda (from 50-110k)
after short time.

However this is with controller settings: CachedIO+WRTHRU. The machine also
spends 100% in IOwait.

If I switch to CachedIO+WRBACK it has no performance problems at all under both
3G and 6G. I'll test it with 12G later. The machine also takes cca 30-70 in
System+irq-iowait time. 



Comment 9 David Kostal 2005-09-11 11:15:25 UTC
Created attachment 118690 [details]
kernel 2.4.21-35.EL + RH megaraid driver

This is a console dump from crash of 2.4.21-35.EL with original RH megaraid
driver. The crash happened just few hours after boot (2 bonnie++ running) and
the only difference against the previous attempts was that now both logical
dives (hda and hdb) were set up in raid bios to WRBACK + CachedIO.

Comment 10 David Kostal 2005-09-14 08:24:01 UTC
Created attachment 118791 [details]
Kernel-2.4.21-32.0.1.EL with dell's megaraid2-v2.10.10.1-1dkms crash dump

Kernel crash dump for kernel-2.4.21-32.0.1.EL with Dell's 
megaraid2-v2.10.10.1-1dkms

Comment 11 David Kostal 2005-09-15 12:11:52 UTC
Created attachment 118842 [details]
2.4.21-32.0.1.EL error with Patrol Read disabled

I turned off Patrol Read (as recommended by Dell support) and original RH
kernel 2.4.21-32.0.1.EL froze (I wasn't able to type anything on console,
serial line nor ssh) within 10 minutes and generated a lot of error messages
(see attachement). Shortly before it froze completely, I was able to observe
that:
* bonnie++ on sdb was running without any problem
* bonnie++ on sda was running but frozen
* on scsi0 (sda) were 4 pending commands (/proc/megaraid/0/stat)
* first few error messages apperared on console

I'm now running 2.4.21-32.0.1.EL with Dell's megaraid2...

Comment 12 David Kostal 2005-10-11 07:56:13 UTC
The problem is solved by new BIOS A02 for PE2850. From the release notes:

* Added workaround for lockup resulting from the systems with 8GB RAM or more
and RAID storage controller potentially claiming inappropriate addresses.

Thanks for help, you can close this bug now.

David

Comment 13 Ernie Petrides 2005-10-11 20:57:34 UTC
Closing as NOTABUG based on last comment.