Description of problem:
Sue Denham requested that I open a bugzilla entry for this performance issue.
Basically, the issue appears when performing random I/O mix to same LUN. The
request service times against that device will creep up. When performing
mostly reads or mostly writes to the same LUN, this behavior is not seen.
Based on the analyses so far, it is possible that there is a possible spinlock
issue in the kernel, a QLogic driver issue, or an OCFS issue.
Dell, QLogic, and Oracle are working on this as well.
The configuration used by Dell is as follows:
RHEL 2.1 AS v2.4.9-e.24
both v6.04.01 and v6.05.00 drivers have been used
2 x 6650s with 4GB memory
PowerPath v3.0.3 b065
Oracle9iRAC & OCFS
CX600 Release 2.02.1.x
When the v6.05.00 driver was used, the following parameters were modified:
execution throttle =256
In the QLogic driver, the highmem_io parameter doesn't compile into the
v6.05.00/v2.4.9-e.25 driver/OS combo so this was left at the default.
sg_segments (128) and max_sectors (1024) were increased in the v6.05.00 driver.
Please note that this is with OCFS, not raw devices.
There are 4 paths to the devices and the devices are split between the 2 SPs.
There are also two switches.
PowerPath is set up so that the CLARiiON Optimize policy is being used sothat
there is load balancing across the SPs. The load balancing is operating
VisualSAN is also being used and shows the traffic across the switches.
On all reads or all writes, the performance has been okay, but when trying to
do both reads and writes, the performance suffers. The service wait times go
way up from Linux point of view according to iostat.
With the same system set up, they tried v2.4.9-e.25 and the v6.05.00 driver,
but it made no difference.
Dell has been using iostat which is reporting a wait time at 80ms and serivce
time of 25ms on both the data LUNs with a mixed workload of reads and writes.
In monitoring the /proc/scsi/qla2300/1, the total number of active and queued
commands is usualy low (0 or 2).
8/15/03 4:59:26 PM Heather Conway:
Feedback from QLogic:
The device_queue_depth is set to 64, but from the above data (initial entry in
the OPT) it looks like the devices are starving. The number of processes
running are not generating enough I/O's to hit the device queue_depth of 64.
The driver is just a pass through so it takes the I/O and sends it out on the
wire and then posts the completion to OS when it's done. If its not recieving
enough I/Os to keep all the devices busy, then the performance is going to
suffer. So my suggestion would be using whatever the application is being used
to generate the I/O's, increase the number of process to be spawned to
something higher like 100/128. In other words, most of the time in the /proc,
one should see "pending request" somewhat close to the max_queue_depth.
The wait time is from "iostat" perspective so it's difficult to say where in
the stack it's spending the time. As a matter of fact, the locking mechanism in
v2.4.x kernel from SCSI subsystem perspective is not very granular. The
io_request_lock is used by a lot of guys on the stack, but what's worth the
experiment is to try disabling the tasklet in the driver and see if that makes
any difference. The way you can disable it is by turning off the following flag
in the qla_settings.h file:
#define QLA2X_PERFORMANCE 0 /* By default this is enabled */
8/15/03 5:01:14 PM Heather Conway:
QLogic also made the recommendation that Dell modify their driver further:
Try after disabling the "use_clustering" support in qla2x00.h file and also
increase the max_sectors from (1024) to 8192.
I do not know yet if Dell has made these changes and if they have, what the
8/15/03 5:19:50 PM Heather Conway:
Per Bob at EMC:
According to Dell, they had tested with changing their testing of Oracle over
by putting the Oracle DB files on some direct attached SCSI disks to the server
as opposed to going through QLOGIC to get at the CLARiiON LUNs. In the process
of doing so, iostat showed 3 times the activity, and yet much lower average I/O
service time. From that, they concluded that it was in theory NOT the Linux
generic SCSI I/O layer that is the culprit (the app is still doing the same
I/O. The OS is still going through the file system code, etc.). They are
claiming that it has to be in QLOGIC.
From what I gathered from QLOGIC's response, their claim is also that the
device driver layer is not getting fed enough I/O to pass down. So, they are
also maintaining that the long wait and service delay is NOT in their layer,
but some other layer above them.
Bob ran his own testing and I have enclosed the output from his testing.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
The issue appears when performing random I/O mix to same LUN. The request
service times against that device will creep up. When performing mostly reads
or mostly writes to the same LUN, this behavior is not seen.
Sue should know that we close bugs that involve binary only modules that can't
be reproduced without those modules being involved. So the question is: can you
reproduce this without binary only kernel modules being involved ?
The problem can still be replicated with or without PowerPath.
removing powerpath from the bug subject then.
I assume EMC has tried using elvtune to counterbalance the AS2.1 default tuning
that favours writes to the cost of reads ?
Has then been tried with an earlier RHEL2.1 kernel with the original qlogic
driver we shipped ?
Mo was going to explore the performance on e.16. Mo, any word on this?
This testing was performed at Dell and at EMC in California. It was done with
RHEL 2.1 v2.4.9-e.24 and v2.4.9-e.25 with the default v6.04.01 driver as well
as the EMC approved v6.05.00 driver. RHEL v2.4.9-e.12 was also used, but with
the v6.05.00 driver. The test was run with Emulex LP9802 HBAs and the v1.22e
driver as well.
The Qlogic drivers were used in their default configurations as well as
modified in an attempt to enhance performance.
I do not know if elvtune was used.
Both the CLARiiON array and the QLogic driver reported that I/O was trickling
down and there was no queue build-up whatsoever.
Apparently, the performance problem is more readily seen with a heavy mix of
random I/O against a multiple LUNs, not with high random I/O against a single
LUN. With the random I/O mix, the request service time against devices builds
up and with the increased service time, the wait queue increases.
One very interesting datapoint would be the version 5 driver as RHEL ships (and
used to ship as default)
also is it possible to obtain a (relatively small) test program for this so that
we can try this in our own labs ?
The v5.31 driver doesn't work properly in a fabric and, subsequently, the host
hangs when attached to a fabric and trying to load the v5.x series driver.
I will check to see if the test program may be distributed.
Created attachment 94177 [details]
test_io_response.csh script enclosed
test_io_response.csh script enclosed
From Dell Engineering (David Mar, Mohammed Kahn, et al)----------
So here's a few more details. Basically this is based upon discussions with
EMC and our tests.
We have a program that is writing to different offsets on an sd in a random
fashion. We see that the backend storage does not see the queue and it
seems that the backend SAN is servicing requests as fast at it receives
To eliminate the possibility of bottle neck in the driver we tried both
qlogic and emulex. So to verify this we look in proc/scsi/scsi the queue
doesn't seem to be building up there, and there is no queue.
Of interest however is that we see the number of merges per second is
similar to the number of requests - implying that a significant portion is
merged why? Is it because the blocks read must be smaller than 4k blocks?
We are speculating there is a problem with the IO within the kernel such
that the sorting algorithm on the queue causes a average wait time to
increase. This in turn builds up the queue and as the sort algorithm
attempts to sort a larger queue it causes the wait time to escalate. This
It may make sense to sort on a sd device with a single head so that the head
doesn't have to move back and forth on the disk. But on a san where there's
a number of spindles and mainly does a cache hit it seems that sorting may
actually slow performance.
Since we don't see the problem on ocfs with SCSI yet we see the problem when
attaching to SAN, it further implies that it is the sort algorithm not being
applicable to a SAN device.
Our question, does this seem plausible that the sort algorithm slows down
Has this issue been resolved? Any workarounds?
I am currently having the same problem. Hardware is
3x Dell 2650, each with 2xqla2300. SAN is an EMC CLARiON.
Software is 2.1AS. Oracle 9RAC, and OCFS. Tried a bunch of kernels
and qla-drivers. Also with and without powerpath.
I have a staging setup with same software on 2x Dell 360 qla2300 and
a cheap Infotrend Fibre ATA SCSI RAID system.
This had the same problems until I installed driver version
6.06.10 from qlogic. The same driver does not help the other
I am wondering if we run into the same problem. We are using HP DL580
g2 with HP MSA1000 as storage array. Using iostats we do not see much
io happening (its not reporting anything close to 100%) while
utilities such as HP OV Performance Agents together with Glance Plus
report 100% busy. We currently using single path on two nodes with an
QLA2312 HBA, using the included driver in e.27 kernel (6.04.01) and I
see that Qlogic has a version for HP posted (6.06.50).
I am still trying to determine what utils I can trust as in regards to
ad-hoc stats (such as iostat/glance plus) and long term statistics
(ala looking for something based on rrdtool which collects data once a
1. Could you please provide a more clear explanation of the impact
of this performance issue? Does the performance problem that you are
acknowledging effect only external FC storage through HBAs as opposed
to internal drivers for othertypes of I/O controllers?
2. Some customers may not be able to requalify and redeploy their
applications to a new OS environment. Is Red Hat working with other
vendors such as SAP, PeopleSoft, to have those vendors' applications
requalified/recertified on RHEL 3.0?
3. Does Red Hat have any type of "official" statement that is ready
for distribution to the general public on this topic? If so, could
you please share it?
I'm still looking into this issue, although I have been rather
distracted by other issues while helping get GFS pushed out. My
testing so far has not indicated a performance problem, so I will have
to rethink my approach to the problem.
The test script that is attached to this bug concerns me in that it is
looking at the numbers in /proc/partitions (i.e. iostat -x). It seems
to me that a more accurate approach to analyzing the performance
problem is look at the time that it takes the writes to go through the
filesystem using O_SYNC.
(Note to Heather: You mentioned to me that you had a version of the
script that wasn't corrupted. Could you please repost that to this bug?)
Another concern that I have is that driver version that is being used.
Is the preformance problem observed when using the qlogic driver
that ships with the RHEl 3 or REHL 2.1? Or is it observed only with
the EMC approved qlogic driver?
Point of information: Heather indicated in today's call that the issue
was driver-independent. Problem was observed with 6.04.01 and the 6.05
The recent patches in comment #152 bug #121434 might help this bug,
although they probably won't provide the complete fix to the problem.
It would be interesting to see if this provided any performance increase.
I'm changing the status of this bugzilla. At this point in time, the
recommendation for customers with this issue is to migrate to RHEL 3.0 or RHEL