Bug 204628

Summary: kernel: Megaraid diskdump painfully slow on ES7000/ONE
Product: Red Hat Enterprise Linux 4 Reporter: Ben Romer <benjamin.romer>
Component: redhat-releaseAssignee: Don Domingo <ddomingo>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3CC: bruce.vessey, charles.sluder, coughlan, ddomingo, jnomura, ntachino
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0196 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-05-01 22:54:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 211071    
Attachments:
Description Flags
a patch to change the megaraid sleep time from 100 to 50 usec. none

Description Ben Romer 2006-08-30 14:53:53 UTC
Description of problem:

We're having a problem with the amount of time it takes to crash dump via the
MegaRaid 320-2X card. I tested this on an ES7000/ONE with 16GB of memory and a
full memory dump took 1 hour 12 minutes and 28 seconds. After many different
tests with various system configurations and memory sizes between 8 and 32GB, we
determined that the problem was that we were quite often falling into the
megaraid_diskdump_sleep() function, which has a 100-microsecond sleep inside it
(line 3963). By tuning this value, I found that a 50 microsecond delay made the
dump time acceptable.

Version-Release number of selected component (if applicable):

This happens on RHEL 4 update 3 with the 2.6.9 kernel.

How reproducible:

Configure a diskdump to go to a partition on the LSI MegaRaid 320-2X card. Crash
the system using the manual crashdump (echo c > /proc/sysrq-trigger).

Additional Info:

The reason for the problem is that the ES7000/ONE has a small amount of
additional hardware between the PCI buses and the processor, which slightly
increases the amount of time each transaction takes. The additional delay is 
long enough that we end up in the sleep code more often than we'd like.

I believe that changing the delay is important for our ES7000 system
performance, so I would like it to be put into the release kernel.

Also, there is a bugzilla entry described in the kernel source RPM's changelog
that may be related to this issue (#151517), but I don't have access to it.

Comment 1 Ben Romer 2006-09-06 14:08:47 UTC
Created attachment 135652 [details]
a patch to change the megaraid sleep time from 100 to 50 usec.

Comment 3 Peter Martuccelli 2006-12-08 13:42:16 UTC
Kimball do you agree with the delay change proposed in C#1?  

Comment 4 Tom Coughlan 2006-12-08 14:01:51 UTC
Nick, Nobuhiro,

Please confirm that reducing the megaraid sleep time from 100 to 50 usec. is
safe. Was there any specific reason that 100 usec. was chosen initially? 

Thanks.

Tom 

Comment 5 Nobuhiro Tachino 2006-12-08 15:02:48 UTC
The change looks safe. The function is called after system panic and it just
does spinloop until receiving the result of I/O completion of the megaraid
adapter. 50 usec would just make the driver peek the register of the megarid
adapter more frequently. I also confirm Fujitsu Japan about this patch.


Comment 6 Jun'ichi Nomura (Red Hat) 2006-12-12 14:33:36 UTC
In local testing, I didn't see any problem by reducing the delay to 50 usec.

As an additional comment, the slowness of diskdump due to the delay of
such checking used to be dramatically improved by tuning block_order value.
(Adding "options diskdump block_order=8" to /etc/modprobe.conf, for example.)

The value affects the size of the buffer which is written in one go.
(buffer size = page size * 2 ^ block_order)
So larger the block_order, smaller the overhead of the check becomes.

If you haven't, it's worth trying, I think.


Comment 7 Kimball Murray 2006-12-12 18:03:31 UTC
Unfortunately, Stratus does not use this card, so I may not be of much help.

I am curious about the purpose of the udelay.  I assume that if it is removed
altogether that the hardware becomes annoyed at all the polling?  If so, then
how do we know for sure what the right delay is? and if not, then why not remove
the udelay?


Comment 8 Ben Romer 2006-12-12 19:15:10 UTC
While testing different delay values I found that (at least on the ES7000) any 
delay less than 50 resulted in increasingly slower times, with a delay of 20 or 
less giving a speed equal to that with the delay set to 100 or more.

Comment 9 Jun'ichi Nomura (Red Hat) 2006-12-13 19:13:59 UTC
Ben,

Have you tried tuning block_order parameter?
FYI, this is what I got by full dump on a machine with 6GB RAM
using the original driver with udelay(100):
  - block_order=2 (default)    46 minutes
  - block_order=6                 14 minutes
  - block_order=8                  8 minutes
  - block_order=10                10 minutes
(The megaraid is write-through mode.)


Comment 10 Ben Romer 2006-12-13 19:42:39 UTC
I'll try some of those values right away. :) As a side question, is there a way 
to determine what this value should be for any given card, or is it driver 
dependent? Is there an easy way to tell what the maximum usable value is?

Comment 11 Jun'ichi Nomura (Red Hat) 2006-12-13 20:45:32 UTC
RE: comment#10,

Optimal value of block_order depends on system.
You have to try some values to determine the best.
So far in my experience (aic79xx, megaraid, mptscsi),
megaraid with write-through cache mode is the only card
which is extremely slow with the default block_order value.

Easy way to find the maximum usable value is to set it in modprobe.conf,
do "service diskdump restart", and then look at /proc/diskdump whether
the value appears there.
If the value is too large, diskdump driver will reduce the value to
acceptable size.
For RHEL4/x86_64, the max is 10.


Comment 12 Tom Coughlan 2006-12-15 15:00:00 UTC
I prefer that you use the existing block_order parameter, rather than changing
the sleep time in the megaraid driver. Less risk. I am currently not planning to
make this patch in 4.5. 

Ben, please confirm that block_order parameter works for you. I will add a
release note about this. 

Comment 13 Don Domingo 2006-12-18 00:45:16 UTC
making this bug block 4.5 release notes, for tracking purposes.

waiting for Tom Coughlan to provide text for release note.

thansk!

Comment 14 Ben Romer 2006-12-20 14:14:51 UTC
I can confirm that this does improve our times. On our 32GB system I'm getting:

block_order=4, 41m 45s
block_order=6, 18m 29s
block_order=8, 15m 18s
block_order=10, 13m 38s

It definitely helps a lot. :)



Comment 17 Tom Coughlan 2007-01-18 15:32:44 UTC
RHEL 4 Release Note
-------------------

(This probably fits best under "General Information".)

Slow disk dump performance may be improved using the "block_order" parameter. 

The disk dump facility provides a parameter called "block_order". This parameter
specifies the I/O block size to be used when writing the dump. We have found
that the default value (2) works well for most adapters and system
configurations. An exception to this has been observed with the Megaraid
hardware in certain system platforms and configurations. This problem can be
solved by increasing the block_order parameter. In one case, the time to dump 6
GB of RAM was reduced from 45 minutes to 10 minutes. 

Larger block_order values consume more module memory. Refer to
/usr/share/doc/diskdumputils-version/README for more information on the
block_order parameter. 

Comment 24 David Lawrence 2007-04-19 19:27:17 UTC
Verified in release notes for next update release.

Comment 26 Red Hat Bugzilla 2007-05-01 22:54:08 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0196.html