204628 – kernel: Megaraid diskdump painfully slow on ES7000/ONE

Bug 204628 - kernel: Megaraid diskdump painfully slow on ES7000/ONE

Summary: kernel: Megaraid diskdump painfully slow on ES7000/ONE

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	redhat-release
Sub Component:
Version:	4.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Don Domingo
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	211071
TreeView+	depends on / blocked

Reported:	2006-08-30 14:53 UTC by Ben Romer
Modified:	2007-11-17 01:14 UTC (History)
CC List:	6 users (show)
Fixed In Version:	RHBA-2007-0196
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-05-01 22:54:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
a patch to change the megaraid sleep time from 100 to 50 usec. (613 bytes, patch) 2006-09-06 14:08 UTC, Ben Romer	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0196	0	normal	SHIPPED_LIVE	New redhat-release package for Red Hat Enterprise Linux 4 Update 5	2007-04-27 18:37:39 UTC

Description Ben Romer 2006-08-30 14:53:53 UTC

Description of problem:

We're having a problem with the amount of time it takes to crash dump via the
MegaRaid 320-2X card. I tested this on an ES7000/ONE with 16GB of memory and a
full memory dump took 1 hour 12 minutes and 28 seconds. After many different
tests with various system configurations and memory sizes between 8 and 32GB, we
determined that the problem was that we were quite often falling into the
megaraid_diskdump_sleep() function, which has a 100-microsecond sleep inside it
(line 3963). By tuning this value, I found that a 50 microsecond delay made the
dump time acceptable.

Version-Release number of selected component (if applicable):

This happens on RHEL 4 update 3 with the 2.6.9 kernel.

How reproducible:

Configure a diskdump to go to a partition on the LSI MegaRaid 320-2X card. Crash
the system using the manual crashdump (echo c > /proc/sysrq-trigger).

Additional Info:

The reason for the problem is that the ES7000/ONE has a small amount of
additional hardware between the PCI buses and the processor, which slightly
increases the amount of time each transaction takes. The additional delay is 
long enough that we end up in the sleep code more often than we'd like.

I believe that changing the delay is important for our ES7000 system
performance, so I would like it to be put into the release kernel.

Also, there is a bugzilla entry described in the kernel source RPM's changelog
that may be related to this issue (#151517), but I don't have access to it.

Comment 1 Ben Romer 2006-09-06 14:08:47 UTC

Created attachment 135652 [details]
a patch to change the megaraid sleep time from 100 to 50 usec.

Comment 3 Peter Martuccelli 2006-12-08 13:42:16 UTC

Kimball do you agree with the delay change proposed in C#1?

Comment 4 Tom Coughlan 2006-12-08 14:01:51 UTC

Nick, Nobuhiro,

Please confirm that reducing the megaraid sleep time from 100 to 50 usec. is
safe. Was there any specific reason that 100 usec. was chosen initially? 

Thanks.

Tom

Comment 5 Nobuhiro Tachino 2006-12-08 15:02:48 UTC

The change looks safe. The function is called after system panic and it just
does spinloop until receiving the result of I/O completion of the megaraid
adapter. 50 usec would just make the driver peek the register of the megarid
adapter more frequently. I also confirm Fujitsu Japan about this patch.

Comment 6 Jun'ichi Nomura (Red Hat) 2006-12-12 14:33:36 UTC

In local testing, I didn't see any problem by reducing the delay to 50 usec.

As an additional comment, the slowness of diskdump due to the delay of
such checking used to be dramatically improved by tuning block_order value.
(Adding "options diskdump block_order=8" to /etc/modprobe.conf, for example.)

The value affects the size of the buffer which is written in one go.
(buffer size = page size * 2 ^ block_order)
So larger the block_order, smaller the overhead of the check becomes.

If you haven't, it's worth trying, I think.

Comment 7 Kimball Murray 2006-12-12 18:03:31 UTC

Unfortunately, Stratus does not use this card, so I may not be of much help.

I am curious about the purpose of the udelay.  I assume that if it is removed
altogether that the hardware becomes annoyed at all the polling?  If so, then
how do we know for sure what the right delay is? and if not, then why not remove
the udelay?

Comment 8 Ben Romer 2006-12-12 19:15:10 UTC

While testing different delay values I found that (at least on the ES7000) any 
delay less than 50 resulted in increasingly slower times, with a delay of 20 or 
less giving a speed equal to that with the delay set to 100 or more.

Comment 9 Jun'ichi Nomura (Red Hat) 2006-12-13 19:13:59 UTC

Ben,

Have you tried tuning block_order parameter?
FYI, this is what I got by full dump on a machine with 6GB RAM
using the original driver with udelay(100):
  - block_order=2 (default)    46 minutes
  - block_order=6                 14 minutes
  - block_order=8                  8 minutes
  - block_order=10                10 minutes
(The megaraid is write-through mode.)

Comment 10 Ben Romer 2006-12-13 19:42:39 UTC

I'll try some of those values right away. :) As a side question, is there a way 
to determine what this value should be for any given card, or is it driver 
dependent? Is there an easy way to tell what the maximum usable value is?

Comment 11 Jun'ichi Nomura (Red Hat) 2006-12-13 20:45:32 UTC

RE: comment#10,

Optimal value of block_order depends on system.
You have to try some values to determine the best.
So far in my experience (aic79xx, megaraid, mptscsi),
megaraid with write-through cache mode is the only card
which is extremely slow with the default block_order value.

Easy way to find the maximum usable value is to set it in modprobe.conf,
do "service diskdump restart", and then look at /proc/diskdump whether
the value appears there.
If the value is too large, diskdump driver will reduce the value to
acceptable size.
For RHEL4/x86_64, the max is 10.

Comment 12 Tom Coughlan 2006-12-15 15:00:00 UTC

I prefer that you use the existing block_order parameter, rather than changing
the sleep time in the megaraid driver. Less risk. I am currently not planning to
make this patch in 4.5. 

Ben, please confirm that block_order parameter works for you. I will add a
release note about this.

Comment 13 Don Domingo 2006-12-18 00:45:16 UTC

making this bug block 4.5 release notes, for tracking purposes.

waiting for Tom Coughlan to provide text for release note.

thansk!

Comment 14 Ben Romer 2006-12-20 14:14:51 UTC

I can confirm that this does improve our times. On our 32GB system I'm getting:

block_order=4, 41m 45s
block_order=6, 18m 29s
block_order=8, 15m 18s
block_order=10, 13m 38s

It definitely helps a lot. :)

Comment 17 Tom Coughlan 2007-01-18 15:32:44 UTC

RHEL 4 Release Note
-------------------

(This probably fits best under "General Information".)

Slow disk dump performance may be improved using the "block_order" parameter. 

The disk dump facility provides a parameter called "block_order". This parameter
specifies the I/O block size to be used when writing the dump. We have found
that the default value (2) works well for most adapters and system
configurations. An exception to this has been observed with the Megaraid
hardware in certain system platforms and configurations. This problem can be
solved by increasing the block_order parameter. In one case, the time to dump 6
GB of RAM was reduced from 45 minutes to 10 minutes. 

Larger block_order values consume more module memory. Refer to
/usr/share/doc/diskdumputils-version/README for more information on the
block_order parameter.

Comment 24 David Lawrence 2007-04-19 19:27:17 UTC

Verified in release notes for next update release.

Comment 26 Red Hat Bugzilla 2007-05-01 22:54:08 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0196.html

Note You need to log in before you can comment on or make changes to this bug.