Bug 234297
Summary: | RHEL4U4 cciss array performance much better than RHEL5 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Peter Klotz <peter.klotz> |
Component: | kernel | Assignee: | Tom Coughlan <coughlan> |
Status: | CLOSED WONTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 5.2 | CC: | aakpinar, coughlan, h.plankl, ito.kazuo, jarod, mike.miller, w.moser |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-07-20 22:02:15 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Peter Klotz
2007-03-28 10:19:37 UTC
The machine uses a Smart Array 6i RAID Controller. We can reproduce the I/O problems without VMware Server and its virtual machines. Simple tests with dd show that parallel read/write operations on the host (especially when performed on the RAID5 array) result in poor performance. RAID5 performance (reading 3GB, writing 1GB): [root@chip icorac_disk]# dd if=3GBtest of=/dev/null 6291456+0 records in 6291456+0 records out 3221225472 bytes (3.2 GB) copied, 188.196 seconds, 17.1 MB/s [root@chip machines]# dd if=/dev/zero of=1GBtest bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 130.694 seconds, 8.2 MB/s The write performance is bad even without a finalizing sync operation. RAID1 performance (reading 3GB, writing 1GB): [root@chip tmp]# dd if=3GBtest of=/dev/null 6291456+0 records in 6291456+0 records out 3221225472 bytes (3.2 GB) copied, 61.4583 seconds, 52.4 MB/s [root@chip tmp]# dd if=/dev/zero of=1GBtest bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 4.47527 seconds, 240 MB/s The drivers for RHEL4 and RHEL5 differ (according to modinfo): RHEL4 ... cciss 2.6.10.RH1 RHEL5 ... cciss 3.6.14-RH1 Maybe a change that was made to this driver explains our performance issue. The cciss driver has received significant updates for rhel5.1. If you would, please give the latest rhel5.1 beta kernel a try and let us know if the performance problems persist. http://people.redhat.com/dzickus/el5/ We had to reinstall RHEL4U4 since it is a production machine. The only machine I got left under RHEL5 has no RAID5 (only RAID1+0) and therefore is no good test candidate. Nevertheless I will try to perform some comparison between stock RHEL5 and the updated kernel you supplied. To show the performance difference I repeated the measurements from Comment #1 under RHEL4U4 on the RAID5 array: [root@chip machines]# time dd if=3GBtest of=/dev/null 6291456+0 records in 6291456+0 records out real 0m48.374s user 0m1.299s sys 0m9.015s [root@chip machines]# time dd if=/dev/zero of=1GBtest bs=1M count=1024 1024+0 records in 1024+0 records out real 0m4.725s user 0m0.002s sys 0m3.164s Comparison of RAID5 performance: RHEL4U4 RHEL5 Reading 3GB 48s 188s Writing 1GB 5s 130s I am aware that especially write performance is influenced by caching but since I used the same hardware this should not have been an issue. It seems that RHEL5 does not use caching at all. The disks are U320 SCSI 300GB 10K RPM HDDs so writing should be much faster than the measured 8.2 MB/s (see Comment #1). I have installed the latest kernel (2.6.18-47PAE) to our production machine, on which we are having same problems. But this did not make any differences at all. Still having load averages around 1000. The machine I have is: HP DL380 G3 with 3x74GB disks having RAID5. The machine has plenty of RAM for a webserver (8GB). If you want I can provide more information. But I have to revert to machine back to RedHat4 (or some other Distribution I must say) soon since this issue makes our website sluggish and unusable. Finally I managed to add a RAID5 (using the already mentioned 300GB HDDs) to the remaining RHEL5 machine. The results are odd since even with the RHEL5 stock kernel (2.6.18-8) I obtained very good results. [root@brain vmtest]# uname -a Linux brain.tilak.ibk 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 x86_64 x86_64 x86_64 GNU/Linux [root@brain vmtest]# dd if=3GBtest of=/dev/null 6291456+0 records in 6291456+0 records out 3221225472 bytes (3.2 GB) copied, 17.692 seconds, 182 MB/s [root@brain vmtest]# dd if=/dev/zero of=1GBtest bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 5.16115 seconds, 208 MB/s There are two differences between both machines I used for testing. * The slow one uses 8GB RAM, the fast one only 3GB * Different firmware The firmware changelog does not mention any HDD performance issues fixed. Could the difference in RAM cause such an phenomenon? Since the 8GB machine is a production server (and currently running under RHEL4) it is not very easy to either reduce the amount of RAM or to upgrade the firmware. 2.6.18-47 performs more or less equal to 2.6.18-8: [root@brain vmtest]# uname -a Linux brain.tilak.ibk 2.6.18-47.el5 #1 SMP Tue Sep 11 17:46:21 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux [root@brain vmtest]# dd if=3GBtest of=/dev/null 6291456+0 records in 6291456+0 records out 3221225472 bytes (3.2 GB) copied, 19.4392 seconds, 166 MB/s [root@brain vmtest]# dd if=/dev/zero of=1GBtest bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 7.1683 seconds, 150 MB/s Tomorrow we will shutdown our production machine and test it with 3GB RAM under RHEL5. This should confirm or rule out the firmware as the origin of our performance issues. Now we've updated our production machine from RHEL4U4 to RHEL5.2. The parallel I/O performance remains really poor (in comparison to RHEL4U4). machine: see comment #1 OS: RHEL5.2 x86_64 VMware: VMware-server-1.0.6-91891 kernel: 2.6.18-92.el5 What controller is being used? It is a HP Smart Array 6i RAID Controller (see Comment #1). This might be the same as, or somewhat related to Bug 237605, closed because we haven't paid enough attention to it... I would like to suggest re-running of the test after changing the value of /sys/block/<device>/queue/nr_requests to its old default, 8192, or something lower, yet still higher than the current default of 128. No reply since June '10. Assuming this is resolved. Closing. |