Description of problem: Sue Denham requested that I open a bugzilla entry for this performance issue. Basically, the issue appears when performing random I/O mix to same LUN. The request service times against that device will creep up. When performing mostly reads or mostly writes to the same LUN, this behavior is not seen. Based on the analyses so far, it is possible that there is a possible spinlock issue in the kernel, a QLogic driver issue, or an OCFS issue. Dell, QLogic, and Oracle are working on this as well. The configuration used by Dell is as follows: RHEL 2.1 AS v2.4.9-e.24 2xQLA2340s both v6.04.01 and v6.05.00 drivers have been used 2 x 6650s with 4GB memory PowerPath v3.0.3 b065 Oracle9iRAC & OCFS CX600 Release 2.02.1.x When the v6.05.00 driver was used, the following parameters were modified: max_queue_depth =64 execution throttle =256 In the QLogic driver, the highmem_io parameter doesn't compile into the v6.05.00/v2.4.9-e.25 driver/OS combo so this was left at the default. sg_segments (128) and max_sectors (1024) were increased in the v6.05.00 driver. Please note that this is with OCFS, not raw devices. There are 4 paths to the devices and the devices are split between the 2 SPs. There are also two switches. PowerPath is set up so that the CLARiiON Optimize policy is being used sothat there is load balancing across the SPs. The load balancing is operating properly. VisualSAN is also being used and shows the traffic across the switches. On all reads or all writes, the performance has been okay, but when trying to do both reads and writes, the performance suffers. The service wait times go way up from Linux point of view according to iostat. With the same system set up, they tried v2.4.9-e.25 and the v6.05.00 driver, but it made no difference. Dell has been using iostat which is reporting a wait time at 80ms and serivce time of 25ms on both the data LUNs with a mixed workload of reads and writes. In monitoring the /proc/scsi/qla2300/1, the total number of active and queued commands is usualy low (0 or 2). 8/15/03 4:59:26 PM Heather Conway: Feedback from QLogic: The device_queue_depth is set to 64, but from the above data (initial entry in the OPT) it looks like the devices are starving. The number of processes running are not generating enough I/O's to hit the device queue_depth of 64. The driver is just a pass through so it takes the I/O and sends it out on the wire and then posts the completion to OS when it's done. If its not recieving enough I/Os to keep all the devices busy, then the performance is going to suffer. So my suggestion would be using whatever the application is being used to generate the I/O's, increase the number of process to be spawned to something higher like 100/128. In other words, most of the time in the /proc, one should see "pending request" somewhat close to the max_queue_depth. The wait time is from "iostat" perspective so it's difficult to say where in the stack it's spending the time. As a matter of fact, the locking mechanism in v2.4.x kernel from SCSI subsystem perspective is not very granular. The io_request_lock is used by a lot of guys on the stack, but what's worth the experiment is to try disabling the tasklet in the driver and see if that makes any difference. The way you can disable it is by turning off the following flag in the qla_settings.h file: #define QLA2X_PERFORMANCE 0 /* By default this is enabled */ 8/15/03 5:01:14 PM Heather Conway: QLogic also made the recommendation that Dell modify their driver further: Try after disabling the "use_clustering" support in qla2x00.h file and also increase the max_sectors from (1024) to 8192. I do not know yet if Dell has made these changes and if they have, what the results are. 8/15/03 5:19:50 PM Heather Conway: Per Bob at EMC: According to Dell, they had tested with changing their testing of Oracle over by putting the Oracle DB files on some direct attached SCSI disks to the server as opposed to going through QLOGIC to get at the CLARiiON LUNs. In the process of doing so, iostat showed 3 times the activity, and yet much lower average I/O service time. From that, they concluded that it was in theory NOT the Linux generic SCSI I/O layer that is the culprit (the app is still doing the same I/O. The OS is still going through the file system code, etc.). They are claiming that it has to be in QLOGIC. From what I gathered from QLOGIC's response, their claim is also that the device driver layer is not getting fed enough I/O to pass down. So, they are also maintaining that the long wait and service delay is NOT in their layer, but some other layer above them. Bob ran his own testing and I have enclosed the output from his testing. Version-Release number of selected component (if applicable): kernel-enterprise-2.4.9-e.24, kernel-smp-2.4.9-e.24 How reproducible: every time Steps to Reproduce: The issue appears when performing random I/O mix to same LUN. The request service times against that device will creep up. When performing mostly reads or mostly writes to the same LUN, this behavior is not seen. Actual results: Expected results: Additional info:
Sue should know that we close bugs that involve binary only modules that can't be reproduced without those modules being involved. So the question is: can you reproduce this without binary only kernel modules being involved ?
The problem can still be replicated with or without PowerPath.
removing powerpath from the bug subject then.
I assume EMC has tried using elvtune to counterbalance the AS2.1 default tuning that favours writes to the cost of reads ? Has then been tried with an earlier RHEL2.1 kernel with the original qlogic driver we shipped ?
Mo was going to explore the performance on e.16. Mo, any word on this?
This testing was performed at Dell and at EMC in California. It was done with RHEL 2.1 v2.4.9-e.24 and v2.4.9-e.25 with the default v6.04.01 driver as well as the EMC approved v6.05.00 driver. RHEL v2.4.9-e.12 was also used, but with the v6.05.00 driver. The test was run with Emulex LP9802 HBAs and the v1.22e driver as well. The Qlogic drivers were used in their default configurations as well as modified in an attempt to enhance performance. I do not know if elvtune was used. Both the CLARiiON array and the QLogic driver reported that I/O was trickling down and there was no queue build-up whatsoever. Apparently, the performance problem is more readily seen with a heavy mix of random I/O against a multiple LUNs, not with high random I/O against a single LUN. With the random I/O mix, the request service time against devices builds up and with the increased service time, the wait queue increases.
One very interesting datapoint would be the version 5 driver as RHEL ships (and used to ship as default) also is it possible to obtain a (relatively small) test program for this so that we can try this in our own labs ?
The v5.31 driver doesn't work properly in a fabric and, subsequently, the host hangs when attached to a fabric and trying to load the v5.x series driver. I will check to see if the test program may be distributed.
Created attachment 94177 [details] test_io_response.csh script enclosed test_io_response.csh script enclosed
From Dell Engineering (David Mar, Mohammed Kahn, et al)---------- So here's a few more details. Basically this is based upon discussions with EMC and our tests. We have a program that is writing to different offsets on an sd in a random fashion. We see that the backend storage does not see the queue and it seems that the backend SAN is servicing requests as fast at it receives them. To eliminate the possibility of bottle neck in the driver we tried both qlogic and emulex. So to verify this we look in proc/scsi/scsi the queue doesn't seem to be building up there, and there is no queue. Of interest however is that we see the number of merges per second is similar to the number of requests - implying that a significant portion is merged why? Is it because the blocks read must be smaller than 4k blocks? We are speculating there is a problem with the IO within the kernel such that the sorting algorithm on the queue causes a average wait time to increase. This in turn builds up the queue and as the sort algorithm attempts to sort a larger queue it causes the wait time to escalate. This snowballs. It may make sense to sort on a sd device with a single head so that the head doesn't have to move back and forth on the disk. But on a san where there's a number of spindles and mainly does a cache hit it seems that sorting may actually slow performance. Since we don't see the problem on ocfs with SCSI yet we see the problem when attaching to SAN, it further implies that it is the sort algorithm not being applicable to a SAN device. Our question, does this seem plausible that the sort algorithm slows down the transactions?
Any update/suggestions?
Has this issue been resolved? Any workarounds? I am currently having the same problem. Hardware is 3x Dell 2650, each with 2xqla2300. SAN is an EMC CLARiON. Software is 2.1AS. Oracle 9RAC, and OCFS. Tried a bunch of kernels and qla-drivers. Also with and without powerpath. I have a staging setup with same software on 2x Dell 360 qla2300 and a cheap Infotrend Fibre ATA SCSI RAID system. This had the same problems until I installed driver version 6.06.10 from qlogic. The same driver does not help the other setup.
I am wondering if we run into the same problem. We are using HP DL580 g2 with HP MSA1000 as storage array. Using iostats we do not see much io happening (its not reporting anything close to 100%) while utilities such as HP OV Performance Agents together with Glance Plus report 100% busy. We currently using single path on two nodes with an QLA2312 HBA, using the included driver in e.27 kernel (6.04.01) and I see that Qlogic has a version for HP posted (6.06.50). I am still trying to determine what utils I can trust as in regards to ad-hoc stats (such as iostat/glance plus) and long term statistics (ala looking for something based on rrdtool which collects data once a minute).
1. Could you please provide a more clear explanation of the impact of this performance issue? Does the performance problem that you are acknowledging effect only external FC storage through HBAs as opposed to internal drivers for othertypes of I/O controllers? 2. Some customers may not be able to requalify and redeploy their applications to a new OS environment. Is Red Hat working with other vendors such as SAP, PeopleSoft, to have those vendors' applications requalified/recertified on RHEL 3.0? 3. Does Red Hat have any type of "official" statement that is ready for distribution to the general public on this topic? If so, could you please share it?
I'm still looking into this issue, although I have been rather distracted by other issues while helping get GFS pushed out. My testing so far has not indicated a performance problem, so I will have to rethink my approach to the problem. The test script that is attached to this bug concerns me in that it is looking at the numbers in /proc/partitions (i.e. iostat -x). It seems to me that a more accurate approach to analyzing the performance problem is look at the time that it takes the writes to go through the filesystem using O_SYNC. (Note to Heather: You mentioned to me that you had a version of the script that wasn't corrupted. Could you please repost that to this bug?) Another concern that I have is that driver version that is being used. Is the preformance problem observed when using the qlogic driver that ships with the RHEl 3 or REHL 2.1? Or is it observed only with the EMC approved qlogic driver?
Point of information: Heather indicated in today's call that the issue was driver-independent. Problem was observed with 6.04.01 and the 6.05 drivers.
The recent patches in comment #152 bug #121434 might help this bug, although they probably won't provide the complete fix to the problem. It would be interesting to see if this provided any performance increase.
I'm changing the status of this bugzilla. At this point in time, the recommendation for customers with this issue is to migrate to RHEL 3.0 or RHEL 4.0. Thanks. Heather