Description of problem: A IBM x345 with double Xeon 2.8GHz processors is connected to a diskarray (IBM EXP400) via a ServerRaid 4Lx Ultra Scsi-controller. The diskarray has a Raid 5 configuration with two partitions which is used as physical volumes (PV) in a volume group (VG). This VG is then used for /home If I start a long copy (big file) will iowait go up to 90-100% on all processors and the stay where for the rest of the copy operation. The system will get unresponsive. This behaviour mostly occurs then a samba client starts a long copy. However I have got it with nfs too. If I make the copy to the internal disks is everything OK and iowait never goes higher than 35% I have another x345 but with a internal disks and raid 5 with one PV. If I make a copy on that will iowait never go higher than 30% Version-Release number of selected component (if applicable): kernel-2.4.21-37.0.1.ELsmp How reproducible: Allways Steps to Reproduce: 1. Start a long copy (a big file) 2. See iowait rise to 90-100% on all processors 3. Get unsresponsive system Actual results: Unsresponsive system Expected results: Responsive system Additional info:
Copying lots of smaller files is no problem.
Please provide a sysreport (or at least /var/log/messages). Make sure you have the latest firmware for the ServerRaid 4Lx. I believe this controller has a battery-backed up cache, and the cache is not used if the battery is dead. So please check the ServerRaid BIOS utility for any errors. Also, let me know the firmware settings you are using for the RAID 5 (e.g. does it let you set the chunk size?). I suspect you would not have this problem with a RAID 1. Is it feasible for you to test this theory? Whan did this start happening? Was there a version of Linux where you did not have this problem?
Created attachment 126077 [details] sysreport
Sysreport provided above Going to install the latest firmware. 7.12.07 instead of 7.10.18. This means weekend work for me :-( The controller has no battery-backed up cache installed. Raid 5 settings is Strip unit size 8kb which is optimal for file/print servers. If the firmware update makes no difference should I construct a raidset with 32 or 64kb stripesize? It is a production system so I needed extra disks to test the raid 1 theory. I created a Raid 1 setup with two 146 Gb disks and tried making long copies. The iowait never got over 45% and the system remained responsive. Tested both with and without LVM but that made no difference. I am not sure I has ever worked a it should. During the first period we had a lot of network errors but some small amount of them could have been this problem.
Firmware updated and problem still exists. I also updated to Update 7 and kernel 2.4.21-40.ELsmp but the problem still exists.
Based on your comment #4, the problem seems specific to RAID 5 on the ServeRaid adapter. RAID 1 performs well. I am surprised this board does not have a battery backed up cache. I thought they all did. You might ask IBM whether this would solve the problem. Maybe they can also advise you about the optimal stripe size for your workload. Beyond that, you can try elvtune http://www.redhat.com/support/wpapers/redhat/ext3/tuning.html and you can try adjusting min/max-readahead: http://www.redhat.com/magazine/001nov04/features/vm/ See "Tuning the VM" I have a report that the following values helped in at least one situation: echo 8192 > /proc/sys/vm/max-readahead echo 2048 > /proc/sys/vm/min-readahead I am going to close this, since it appears to be a ServeRaid RAID 5 performance problem. Re-open it if there is more to it than that.