Description of problem:
Customer is running into performance problems when running with
Informix. They have reported that the RHEL 3.0 (v2.4.21-4.EL and
v2.4.21-9.0.1.EL) host slows down and pauses when running Informix
I/O over FC attach, but when running app on the local disk, the
performance is acceptable. They have tried this with both the
v6.06.00b11 and v6.07.02 QLogic drivers and both with and without
PowerPath. The summary is as follows:
When running PP w/AS 2.1, when Informix archiving was stared
it had a permanent impact
on the performance of the host.
Now, when running PP w/3.0, when Informix archiving is
started the system performance
returns to normal as the archiving process finished.
Lastly, and most importantly, the customer removed PP from
the 3.0 environment and the SAME THING HAPPENED. System performance
returns to normal as the archiving process finished.
This is very readily reproduceable at the customer site.
Does this appear to Red Ht to be related to the __make_request bug
that is being fixed in the RHEL 3.0 U2?
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Hello, Heather. I'm a bit confused about this bug report. It seems
that once archiving completes, performance returns to normal. Isn't
Also, I don't have any record of a "__make_request bug" that is
being fixed in RHEL3 U2. Could you please refer to a specific
bugzilla id or changelog entry to identify this?
(I'm tentatively reassigning this to Doug Ledford.)
Created attachment 99603 [details]
diff file between ll_rw_blck.c in v2.4.21-9.0.1.EL vs. v2.4.21-13.EL
The changes referred to are between the ll_rw_blk.c file in v2.4.21-9.0.1.EL
vs. v2.4.21-13.EL, specifically in the areas of the request queue CPU masks and
when the queue gets plugged/unplugged.
The changes log refers to
disable linux-2.4.21-scsi-affine-queue patch
Doug/Ernie: On EMC status call today, EMC is looking for any insights
from us on whether the changes made to ll_rw_blk.c between 2.4.21-
9.0.1.EL vs -13.EL could be related to this issue.
Per the customer, they have downloaded and installed the RPMs for the
RHEL 3.0 U2. Unfortunately, the new kernel does not appear to have
resolved the problem. Do you have any idea on what the issue may be
and how to resolve it?
Please let me know if additional information is required.
Any suggestions/comments on what may be the issue at this customer
Any feedback is appreciated.
OK, the problem statement here is not clear. So let me ask for
clarification. Is the problem that AS2.1 has permanent poor
performance after starting an Informix archive operation and is
therefore unusable and that RHEL3 U2 has temporary poor performance
during an Informix archive operation and is therefore unusable? If
so, then am I correct in assuming that RHEL3 U2 is closer to what they
want, aka no performance impact noticable during an archive operation,
but not close enough? Finally, you say that with local disks the
problem doesn't exist, but can you tell me what local disks means? Is
it actual SCSI disks in the machine, and if so attached to what SCSI
controller, what class of disks, etc. For the FC attachment, is it
going through a fiber switch? If so, have they tried direct FC
attachment to see if it behaves similar to direct SCSI attached disks?
Since they are using the QLogic driver, can you get the output of
dmesg that shows the QLogic driver's configuration, including things
like the tagged queue depth on the attached device(s) and attach it to
this bugzilla? The contents of the /proc/scsi/qla*/* files might be
useful as well.
Created attachment 99759 [details]
requested files in tarball are attached
RHEL 2.1 is known to have performance issues in general and is used
here as a comparison to RHEL 3.0. RHEL 3.0 does perform better then
RHEL 2.1 for the customer, but the performance is still not
There are no noticeable improviements between RHEL 3.0 vs. RHEL 3.0
The local disks are attached to the internal SCSI controller which in
this case is a MegaRAID controller.
Yes, the fibre attach is via a switch. They do not have the time nor
at this point, the willingness to try a direct attach.
I do not have the most recent output from the customer's system using
the U2. I will attach the files from the prior v2.4.21-4.0.1 config.
Heather, could you please ask the customer to try decreasing the read
latency with elvtune ?
It may be possible that the IO queues are simply too big for the
fibrechannel disks, leading to high latencies and the bad read
performance that goes with long queues.
Would you please provide recommendation as to what value the read
latency should be decreased to?
It would be worth trying some different values, to get the right
latency/throughput tradeoff for your configuration. I would probably
try 8, 32 and 128, maybe even a higher value.
I looked over the files and didn't see anything overly odd except the
EMC Powerpath stuff is loaded. We've got lots of reports of poor
performance when that's in use. I'm not positive if it's related or
not. However, as of RHEL3 U2, there is a new service that is
available to admins, service mdmpd (MD MultiPath Daemon) which works
similar to EMC's Powerpath but without being a kernel module. It's
part of the mdadm-1.4.0-1 and later mdadm packages. If they are
willing to test with it instead of Powerpath, it would be useful. Let
me know if they'll give it a shot, and if so I'll post directions on
how to get it set up.
Heather, I should note that I'm not trying to step on EMC's toes with
this suggestion. I don't know if you guys sell Powerpath or just give
it away with your external boxes. But, we do have a generic solution
for the multipath problem now (although still a bit young, the 2.6
version of the solution is nicer), and it doesn't require any
additional kernel modules besides the MD multipath that's been in the
kernel for a long time. The main thing is that it also doesn't sit
inbetween I/Os at all, so it would help identify the possible source
of the high latency should the processing in Powerpath be involved.
We have not tested the mdmpd and therefore won't recommend that a
customer attempt to use it with our storage as we are not prepared to
appropriately support the customer.
Please note that the problem being reported by the customer occurs
both with and without PowerPath.
OK, so the general problem description right now is that it happens
with fiber channel through a switch, with and without powerpath, and
does not happen on locally connected disks. Basically, this should
exonerate the scsi mid layer, the block layer, the sd driver. It
leaves in question issues related to the qlogic driver vs. the
megaraid driver and it leaves the question of direct attach vs.
through a fiber switch. The __make_request changes could impact this.
Any kernel version later than 2.4.21-12.EL has those changes in it.
If these later kernels still have the same slowdown problem, then my
first suspect would be some interaction between the switch fabric and
the load patterns coming out of the qlogic driver.
I believe v2.4.21-15.EL has resolved the problem at the customer site
so I am closing this issue.
Thanks for your suggestions and help.