Red Hat Bugzilla – Bug 167166
megaraid2 kernel panic
Last modified: 2007-11-30 17:07:08 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050524 CentOS/1.0.4-1.4.1.centos3 Firefox/1.0.4
Description of problem:
After running bonnie++ for longer time in loop the kernel starts to report I/O errors on disk or panics.
Version-Release number of selected component (if applicable):
2.4.21-35.EL, 2.4.21-32.0.1.EL and kernel-2.4.21-27.EL
Steps to Reproduce:
1.run bonnie++ and wait for few hours
Actual Results: I/O errors on disk or kernel panic
Expected Results: system should run...
Host, OS info:
Dell PE 2850, 2xXeon, 12GB RAM, 1xPERC 4i/DC (1logical drive raid 10, scsi1, sdb), 1xPERC 4e/DC (1logical drive raid 10, scsi0, sda), 1xQLA2312 (no drives attached, scsi2)
RedHat Enterprise Linux ES 3 U5 x86_64, latest updates
DELL PERC drivers:
Kernel from RHEL ES Beta:
Filesystems: ext3, no sw raid, no lvm
After running 1 or 2 bonnie++ (one on sda, one on sdb) the kernel crashed after few hours of benchmarking in loop
I also tried kernels 2.4.21-35.EL, 2.4.21-32.0.1.EL and kernel-2.4.21-27.EL withe default megaraid2 from RH
Running bonnie++ on RH driver causes scsi I/O errors after some time of test.
It mmight happen that throuput for writing goes down to 100blocks/sec (according to iostat) even if there are many dirty buffers to be written, bonnie than takes more than night to finish. Reading from this disk (dd if=/dev/sda1) at the same time seems to improve write performance again.
Please attach console panic or oops output (capturing it with a serial
line if necessary). Thanks in advance.
Also please post the I/O error messages you got while running the default
megaraid2 driver, as shipped by RH. This would be most helpful if you to use
your most recent RH kernel (2.4.21-35.EL? ). Thanks.
*** Bug 167167 has been marked as a duplicate of this bug. ***
Created attachment 118334 [details]
(partial) oops trace and messages from 2.4.21-35.EL+megaraid2-v18.104.22.168-1dkms
I'm now running kernel-2.4.21-32.0.1.EL+megaraid2-v22.214.171.124-1dkms for 23 hours
without crash. I'll boot with RH megaraid2 driver and try to reproduce a crash.
Created attachment 118335 [details]
kernel-2.4.21-32.0.1.EL + RH megaraid v126.96.36.199-RH1 I/O error
Here is an I/O error when running kernel-2.4.21-32.0.1.EL + default RH
megaraid2 (v188.8.131.52-RH1). I have only one I/O error here, because at that time
I didn't have remote syslog set up yet. Also all errors with default megaraid2
driver - as far as I have noticed - were only on hdb/scsi1 - internal perc
I wasn't able to reproduce the crash or I/O error now even after cca 23hours of
testing with kernel-2.4.21-35.EL + RH megaraid driver. I continue testing (now
with fresh reboot again).
However, what I experience every time is the slow down of the disk operations.
After reboot, both sda (on perc 4e/DC) and sdb (on perc 4i/DC) are very fast,
when runnig bonnie++, sda can do cca 50-110k blocks per sec and sdb can do cca
40-50k bps (iostat 1). But if I run it first on one disk after that on the
socond one OR both at the same time, the performance goes down on both to max
3-4k bps and sometimes even doing nothing for a very long time. Although doing
'dd if=/dev/sda1 | cat - > /dev/null' (or sdb...) helps a bit for short time
(eg. from 100bps to 5k bps), it goes down to nearly-no-performance shortly
again. If one disk starts to be slow, the other one will become slow too.
When running bonnies on both disk simultaneously (after reboot) the performance
for both disks is not at the highest level neither, but it "acceptable" :-/ as
it does cca 30-40k bps for both disks (counted together). After cca 30minutes it
anyway goes down to max 3-4k bps.
For example, during todays night, I had 2 bonnies running and both were still
running (one was reading already...) after 16hours.
There is no background task running on the disks. Both raid controllers
It might or might not be related to the crashing issue but this slow-down also
makes the machine unusable:(
The other problem I've seen is that interactive responsibility is really horible
with these disk benchmarks running. The machine is not swapping but the shells
might take up to 10 minutes to become responsible. However (tested only once) it
became much better when I executed
echo 1 > /proc/sys/vm/skip_mapped_pages
echo 1 10 15 > /proc/sys/vm/pagecache
echo "30 500 0 0 500 3000 80 50 0" > /proc/sys/vm/bdflush
as was advised in some other bugs here.
Hi again. Unfortunately I wasn't able to reproduce crash in the last week
(before I reported this it crashed with all but one kernel (2.4.21-32.0.1.EL +
megaraid2 from Dell) I tried. Difference is only that now I run without
hyperthreading but in the past I had one crash also without HT.
What remains are the performance problems. If I boot with mem=3G there is no
performance degradation at all.
If I boot with mem=6G the speed goes down to 5-15k bps on sda (from 50-110k)
after short time.
However this is with controller settings: CachedIO+WRTHRU. The machine also
spends 100% in IOwait.
If I switch to CachedIO+WRBACK it has no performance problems at all under both
3G and 6G. I'll test it with 12G later. The machine also takes cca 30-70 in
Created attachment 118690 [details]
kernel 2.4.21-35.EL + RH megaraid driver
This is a console dump from crash of 2.4.21-35.EL with original RH megaraid
driver. The crash happened just few hours after boot (2 bonnie++ running) and
the only difference against the previous attempts was that now both logical
dives (hda and hdb) were set up in raid bios to WRBACK + CachedIO.
Created attachment 118791 [details]
Kernel-2.4.21-32.0.1.EL with dell's megaraid2-v184.108.40.206-1dkms crash dump
Kernel crash dump for kernel-2.4.21-32.0.1.EL with Dell's
Created attachment 118842 [details]
2.4.21-32.0.1.EL error with Patrol Read disabled
I turned off Patrol Read (as recommended by Dell support) and original RH
kernel 2.4.21-32.0.1.EL froze (I wasn't able to type anything on console,
serial line nor ssh) within 10 minutes and generated a lot of error messages
(see attachement). Shortly before it froze completely, I was able to observe
* bonnie++ on sdb was running without any problem
* bonnie++ on sda was running but frozen
* on scsi0 (sda) were 4 pending commands (/proc/megaraid/0/stat)
* first few error messages apperared on console
I'm now running 2.4.21-32.0.1.EL with Dell's megaraid2...
The problem is solved by new BIOS A02 for PE2850. From the release notes:
* Added workaround for lockup resulting from the systems with 8GB RAM or more
and RAID storage controller potentially claiming inappropriate addresses.
Thanks for help, you can close this bug now.
Closing as NOTABUG based on last comment.