Red Hat Bugzilla – Bug 204628
kernel: Megaraid diskdump painfully slow on ES7000/ONE
Last modified: 2007-11-16 20:14:53 EST
Description of problem:
We're having a problem with the amount of time it takes to crash dump via the
MegaRaid 320-2X card. I tested this on an ES7000/ONE with 16GB of memory and a
full memory dump took 1 hour 12 minutes and 28 seconds. After many different
tests with various system configurations and memory sizes between 8 and 32GB, we
determined that the problem was that we were quite often falling into the
megaraid_diskdump_sleep() function, which has a 100-microsecond sleep inside it
(line 3963). By tuning this value, I found that a 50 microsecond delay made the
dump time acceptable.
Version-Release number of selected component (if applicable):
This happens on RHEL 4 update 3 with the 2.6.9 kernel.
Configure a diskdump to go to a partition on the LSI MegaRaid 320-2X card. Crash
the system using the manual crashdump (echo c > /proc/sysrq-trigger).
The reason for the problem is that the ES7000/ONE has a small amount of
additional hardware between the PCI buses and the processor, which slightly
increases the amount of time each transaction takes. The additional delay is
long enough that we end up in the sleep code more often than we'd like.
I believe that changing the delay is important for our ES7000 system
performance, so I would like it to be put into the release kernel.
Also, there is a bugzilla entry described in the kernel source RPM's changelog
that may be related to this issue (#151517), but I don't have access to it.
Created attachment 135652 [details]
a patch to change the megaraid sleep time from 100 to 50 usec.
Kimball do you agree with the delay change proposed in C#1?
Please confirm that reducing the megaraid sleep time from 100 to 50 usec. is
safe. Was there any specific reason that 100 usec. was chosen initially?
The change looks safe. The function is called after system panic and it just
does spinloop until receiving the result of I/O completion of the megaraid
adapter. 50 usec would just make the driver peek the register of the megarid
adapter more frequently. I also confirm Fujitsu Japan about this patch.
In local testing, I didn't see any problem by reducing the delay to 50 usec.
As an additional comment, the slowness of diskdump due to the delay of
such checking used to be dramatically improved by tuning block_order value.
(Adding "options diskdump block_order=8" to /etc/modprobe.conf, for example.)
The value affects the size of the buffer which is written in one go.
(buffer size = page size * 2 ^ block_order)
So larger the block_order, smaller the overhead of the check becomes.
If you haven't, it's worth trying, I think.
Unfortunately, Stratus does not use this card, so I may not be of much help.
I am curious about the purpose of the udelay. I assume that if it is removed
altogether that the hardware becomes annoyed at all the polling? If so, then
how do we know for sure what the right delay is? and if not, then why not remove
While testing different delay values I found that (at least on the ES7000) any
delay less than 50 resulted in increasingly slower times, with a delay of 20 or
less giving a speed equal to that with the delay set to 100 or more.
Have you tried tuning block_order parameter?
FYI, this is what I got by full dump on a machine with 6GB RAM
using the original driver with udelay(100):
- block_order=2 (default) 46 minutes
- block_order=6 14 minutes
- block_order=8 8 minutes
- block_order=10 10 minutes
(The megaraid is write-through mode.)
I'll try some of those values right away. :) As a side question, is there a way
to determine what this value should be for any given card, or is it driver
dependent? Is there an easy way to tell what the maximum usable value is?
Optimal value of block_order depends on system.
You have to try some values to determine the best.
So far in my experience (aic79xx, megaraid, mptscsi),
megaraid with write-through cache mode is the only card
which is extremely slow with the default block_order value.
Easy way to find the maximum usable value is to set it in modprobe.conf,
do "service diskdump restart", and then look at /proc/diskdump whether
the value appears there.
If the value is too large, diskdump driver will reduce the value to
For RHEL4/x86_64, the max is 10.
I prefer that you use the existing block_order parameter, rather than changing
the sleep time in the megaraid driver. Less risk. I am currently not planning to
make this patch in 4.5.
Ben, please confirm that block_order parameter works for you. I will add a
release note about this.
making this bug block 4.5 release notes, for tracking purposes.
waiting for Tom Coughlan to provide text for release note.
I can confirm that this does improve our times. On our 32GB system I'm getting:
block_order=4, 41m 45s
block_order=6, 18m 29s
block_order=8, 15m 18s
block_order=10, 13m 38s
It definitely helps a lot. :)
RHEL 4 Release Note
(This probably fits best under "General Information".)
Slow disk dump performance may be improved using the "block_order" parameter.
The disk dump facility provides a parameter called "block_order". This parameter
specifies the I/O block size to be used when writing the dump. We have found
that the default value (2) works well for most adapters and system
configurations. An exception to this has been observed with the Megaraid
hardware in certain system platforms and configurations. This problem can be
solved by increasing the block_order parameter. In one case, the time to dump 6
GB of RAM was reduced from 45 minutes to 10 minutes.
Larger block_order values consume more module memory. Refer to
/usr/share/doc/diskdumputils-version/README for more information on the
Verified in release notes for next update release.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.