Bug 103300 - Performance issues with RHEL 2.1 AS/Oracle9iRAC/OCFS/CLARiiON arrays
Summary: Performance issues with RHEL 2.1 AS/Oracle9iRAC/OCFS/CLARiiON arrays
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel
Version: 2.1
Hardware: i686
OS: Linux
high
high
Target Milestone: ---
Assignee: Don Howard
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-08-28 16:39 UTC by Heather Conway
Modified: 2007-11-30 22:06 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-01-04 14:57:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
test_io_response.csh script enclosed (2.56 KB, text/plain)
2003-09-03 19:17 UTC, Heather Conway
no flags Details

Description Heather Conway 2003-08-28 16:39:23 UTC
Description of problem:
Sue Denham requested that I open a bugzilla entry for this performance issue.
Basically, the issue appears when performing random I/O mix to same LUN.  The 
request service times against that device will creep up.  When performing 
mostly reads or mostly writes to the same LUN, this behavior is not seen.
Based on the analyses so far, it is possible that there is a possible spinlock 
issue in the kernel, a QLogic driver issue, or an OCFS issue.
 
Dell, QLogic, and Oracle are working on this as well.  

The configuration used by Dell is as follows:
RHEL 2.1 AS v2.4.9-e.24
2xQLA2340s 
both v6.04.01 and v6.05.00 drivers have been used
2 x 6650s with 4GB memory
PowerPath v3.0.3 b065
Oracle9iRAC & OCFS
CX600 Release 2.02.1.x
When the v6.05.00 driver was used, the following parameters were modified:
max_queue_depth =64 
execution throttle =256
In the QLogic driver, the highmem_io parameter doesn't compile into the 
v6.05.00/v2.4.9-e.25 driver/OS combo so this was left at the default.
sg_segments (128) and max_sectors (1024) were increased in the v6.05.00 driver. 
Please note that this is with OCFS, not raw devices.  

There are 4 paths to the devices and the devices are split between the 2 SPs. 
There are also two switches.
PowerPath is set up so that the CLARiiON Optimize policy is being used sothat 
there is load balancing across the SPs. The load balancing is operating 
properly. 
VisualSAN is also being used and shows the traffic across the switches.

On all reads or all writes, the performance has been okay, but when trying to 
do both reads and writes, the performance suffers. The service wait times go 
way up from Linux point of view according to iostat.

With the same system set up, they tried v2.4.9-e.25 and the v6.05.00 driver,
but it made no difference.

Dell has been using iostat which is reporting a wait time at 80ms and serivce 
time of 25ms on both the data LUNs with a mixed workload of reads and writes. 
In monitoring the /proc/scsi/qla2300/1, the total number of active and queued 
commands is usualy low (0 or 2).

8/15/03 4:59:26 PM Heather Conway:
Feedback from QLogic:
The device_queue_depth is set to 64, but from the above data (initial entry in 
the OPT) it looks like the devices are starving. The number of processes 
running are not generating enough I/O's to hit the device queue_depth of 64. 
The driver is just a pass through so it takes the I/O and sends it out on the 
wire and then posts the completion to OS when it's done. If its not recieving 
enough I/Os to keep all the devices busy, then the performance is going to 
suffer. So my suggestion would be using whatever the application is being used 
to generate the I/O's, increase the number of process to be spawned to 
something higher like 100/128. In other words, most of the time in the /proc, 
one should see "pending request" somewhat close to the max_queue_depth. 

The wait time is from "iostat" perspective so it's difficult to say where in 
the stack it's spending the time. As a matter of fact, the locking mechanism in 
v2.4.x kernel from SCSI subsystem perspective is not very granular. The 
io_request_lock is used by a lot of guys on the stack, but what's worth the 
experiment is to try disabling the tasklet in the driver and see if that makes 
any difference. The way you can disable it is by turning off the following flag 
in the qla_settings.h file:

#define QLA2X_PERFORMANCE 0 /* By default this is enabled */

8/15/03 5:01:14 PM Heather Conway:
QLogic also made the recommendation that Dell modify their driver further:

Try after disabling the "use_clustering" support in qla2x00.h file and also 
increase the max_sectors from (1024) to 8192.

I do not know yet if Dell has made these changes and if they have, what the 
results are.

8/15/03 5:19:50 PM Heather Conway:
Per Bob at EMC:
According to Dell, they had tested with changing their testing of Oracle over 
by putting the Oracle DB files on some direct attached SCSI disks to the server 
as opposed to going through QLOGIC to get at the CLARiiON LUNs. In the process 
of doing so, iostat showed 3 times the activity, and yet much lower average I/O 
service time. From that, they concluded that it was in theory NOT the Linux 
generic SCSI I/O layer that is the culprit (the app is still doing the same 
I/O. The OS is still going through the file system code, etc.). They are 
claiming that it has to be in QLOGIC. 

From what I gathered from QLOGIC's response, their claim is also that the 
device driver layer is not getting fed enough I/O to pass down. So, they are 
also maintaining that the long wait and service delay is NOT in their layer, 
but some other layer above them. 

Bob ran his own testing and I have enclosed the output from his testing.



Version-Release number of selected component (if applicable):
kernel-enterprise-2.4.9-e.24, kernel-smp-2.4.9-e.24

How reproducible:
every time

Steps to Reproduce:
The issue appears when performing random I/O mix to same LUN.  The request 
service times against that device will creep up.  When performing mostly reads 
or mostly writes to the same LUN, this behavior is not seen.
    
Actual results:


Expected results:


Additional info:

Comment 1 Arjan van de Ven 2003-08-28 16:44:51 UTC
Sue should know that we close bugs that involve binary only modules that can't
be reproduced without those modules being involved. So the question is: can you
reproduce this without binary only kernel modules being involved ?

Comment 2 Heather Conway 2003-08-28 17:43:34 UTC
The problem can still be replicated with or without PowerPath.

Comment 3 Arjan van de Ven 2003-08-28 17:47:31 UTC
removing powerpath from the bug subject then.


Comment 4 Arjan van de Ven 2003-08-28 17:48:54 UTC
I assume EMC has tried using elvtune to counterbalance the AS2.1 default tuning
that favours writes to the cost of reads ?

Has then been tried with an earlier RHEL2.1 kernel with the original qlogic
driver we shipped ?


Comment 5 Gary Lerhaupt 2003-08-28 18:05:37 UTC
Mo was going to explore the performance on e.16.  Mo, any word on this?

Comment 6 Heather Conway 2003-08-28 18:11:03 UTC
This testing was performed at Dell and at EMC in California.  It was done with 
RHEL 2.1 v2.4.9-e.24 and v2.4.9-e.25 with the default v6.04.01 driver as well 
as the EMC approved v6.05.00 driver.  RHEL v2.4.9-e.12 was also used, but with 
the v6.05.00 driver.  The test was run with Emulex LP9802 HBAs and the v1.22e 
driver as well.
The Qlogic drivers were used in their default configurations as well as 
modified in an attempt to enhance performance.
I do not know if elvtune was used.
Both the CLARiiON array and the QLogic driver reported that I/O was trickling 
down and there was no queue build-up whatsoever.   
Apparently, the performance problem is more readily seen with a heavy mix of 
random I/O against a multiple LUNs, not with high random I/O against a single 
LUN.  With the random I/O mix, the request service time against devices builds 
up and with the increased service time, the wait queue increases.

Comment 7 Arjan van de Ven 2003-08-28 20:11:51 UTC
One very interesting datapoint would be the version 5 driver as RHEL ships (and
used to ship as default)

also is it possible to obtain a (relatively small) test program for this so that
we can try this in our own labs ?

Comment 8 Heather Conway 2003-09-02 13:08:05 UTC
The v5.31 driver doesn't work properly in a fabric and, subsequently, the host 
hangs when attached to a fabric and trying to load the v5.x series driver.
I will check to see if the test program may be distributed.  

Comment 9 Heather Conway 2003-09-03 19:17:44 UTC
Created attachment 94177 [details]
test_io_response.csh script enclosed

test_io_response.csh script enclosed

Comment 10 Erich Morisse 2003-09-05 18:49:21 UTC
From Dell Engineering (David Mar, Mohammed Kahn, et al)----------

So here's a few more details.  Basically this is based upon discussions with
EMC and our tests.  

We have a program that is writing to different offsets on an sd in a random
fashion.  We see that the backend storage does not see the queue and it
seems that the backend SAN is servicing requests as fast at it receives
them.  

To eliminate the possibility of bottle neck in the driver we tried both
qlogic and emulex.  So to verify this we look in proc/scsi/scsi the queue
doesn't seem to be building up there, and there is no queue.

Of interest however is that we see the number of merges per second is
similar to the number of requests - implying that a significant portion is
merged why?  Is it because the blocks read must be smaller than 4k blocks?

We are speculating there is a problem with the IO within the kernel such
that the sorting algorithm on the queue causes a average wait time to
increase.  This in turn builds up the queue and as the sort algorithm
attempts to sort a larger queue it causes the wait time to escalate.  This
snowballs.

It may make sense to sort on a sd device with a single head so that the head
doesn't have to move back and forth on the disk.  But on a san where there's
a number of spindles and mainly does a cache hit it seems that sorting may
actually slow performance.

Since we don't see the problem on ocfs with SCSI yet we see the problem when
attaching to SAN, it further implies that it is the sort algorithm not being
applicable to a SAN device.  

Our question, does this seem plausible that the sort algorithm slows down
the transactions? 

Comment 11 Heather Conway 2003-10-10 16:57:12 UTC
Any update/suggestions?

Comment 12 Terje Malmedal 2003-12-08 10:52:15 UTC
Has this issue been resolved? Any workarounds?

I am currently having the same problem. Hardware is
3x Dell 2650, each with 2xqla2300. SAN is an EMC CLARiON.

Software is 2.1AS. Oracle 9RAC, and OCFS. Tried a bunch of kernels
and qla-drivers. Also with and without powerpath.

I have a staging setup with same software on 2x Dell 360 qla2300 and
a cheap Infotrend Fibre ATA SCSI RAID system.

This had the same problems until I installed driver version
6.06.10 from qlogic. The same driver does not help the other
setup. 





Comment 16 Ulf Zimmermann 2004-01-14 01:46:55 UTC
I am wondering if we run into the same problem. We are using HP DL580
g2 with HP MSA1000 as storage array. Using iostats we do not see much
io happening (its not reporting anything close to 100%) while
utilities such as HP OV Performance Agents together with Glance Plus
report 100% busy. We currently using single path on two nodes with an
QLA2312 HBA, using the included driver in e.27 kernel (6.04.01) and I
see that Qlogic has a version for HP posted (6.06.50).

I am still trying to determine what utils I can trust as in regards to
ad-hoc stats (such as iostat/glance plus) and long term statistics
(ala looking for something based on rrdtool which collects data once a
minute).


Comment 17 Heather Conway 2004-03-26 16:25:12 UTC
1.  Could you please provide a more clear explanation of the impact 
of this performance issue?  Does the performance problem that you are 
acknowledging effect only external FC storage through HBAs as opposed 
to internal drivers for othertypes of I/O controllers?
2.  Some customers may not be able to requalify and redeploy their 
applications to a new OS environment.  Is Red Hat working with other 
vendors such as SAP, PeopleSoft, to have those vendors' applications 
requalified/recertified on RHEL 3.0?  
3.  Does Red Hat have any type of "official" statement that is ready 
for distribution to the general public on this topic?  If so, could 
you please share it? 

Comment 19 Adam "mantis" Manthei 2004-05-21 16:25:55 UTC
I'm still looking into this issue, although I have been rather
distracted by other issues while helping get GFS pushed out.  My
testing so far has not indicated a performance problem, so I will have
to rethink my approach to the problem.  

The test script that is attached to this bug concerns me in that it is
looking at the numbers in /proc/partitions (i.e. iostat -x).  It seems
to me that a more accurate approach to analyzing the performance
problem is look at the time that it takes the writes to go through the
filesystem using O_SYNC. 

(Note to Heather:  You mentioned to me that you had a version of the
script that wasn't corrupted.  Could you please repost that to this bug?)

Another concern that I have is that driver version that is being used.
   Is the preformance problem observed when using the qlogic driver
that ships with the RHEl 3 or REHL 2.1?  Or is it observed only with
the EMC approved qlogic driver?

Comment 20 Greg Nuss 2004-05-21 18:39:04 UTC
Point of information: Heather indicated in today's call that the issue
was driver-independent. Problem was observed with 6.04.01 and the 6.05
drivers.

Comment 21 Adam "mantis" Manthei 2004-09-09 18:30:08 UTC
The recent patches in comment #152 bug #121434 might help this bug,
although they probably won't provide the complete fix to the problem.
 It would be interesting to see if this provided any performance increase.

Comment 22 Heather Conway 2006-01-04 14:57:23 UTC
I'm changing the status of this bugzilla.  At this point in time, the 
recommendation for customers with this issue is to migrate to RHEL 3.0 or RHEL 
4.0.
Thanks.
Heather


Note You need to log in before you can comment on or make changes to this bug.