Bug 183598

Summary: high iowaits with long transfers
Product: Red Hat Enterprise Linux 3 Reporter: Johan Dahl <johan.dahl>
Component: kernelAssignee: Tom Coughlan <coughlan>
Status: CLOSED CANTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: johan.dahl, lwoodman, petrides
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-03-20 14:39:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sysreport none

Description Johan Dahl 2006-03-02 07:40:54 UTC
Description of problem:

A IBM x345 with double Xeon 2.8GHz processors is connected to a diskarray (IBM
EXP400) via a ServerRaid 4Lx Ultra Scsi-controller. The diskarray has a Raid 5
configuration with two partitions which is used as physical volumes (PV) in a
volume group (VG). This VG is then used for /home

If I start a long copy (big file) will iowait go up to 90-100% on all processors
and the stay where for the rest of the copy operation. The system will get
unresponsive. This behaviour mostly occurs then a samba client starts a long
copy. However I have got it with nfs too.

If I make the copy to the internal disks is everything OK and iowait never goes
higher than 35%
I have another x345 but with a internal disks and raid 5 with one PV. If I make
a copy on that will iowait never go higher than 30%


Version-Release number of selected component (if applicable):
kernel-2.4.21-37.0.1.ELsmp

How reproducible:
Allways

Steps to Reproduce:
1. Start a long copy (a big file)
2. See iowait rise to 90-100% on all processors
3. Get unsresponsive system
  
Actual results:
Unsresponsive system

Expected results:
Responsive system

Additional info:

Comment 1 Johan Dahl 2006-03-02 07:44:24 UTC
Copying lots of smaller files is no problem.

Comment 2 Tom Coughlan 2006-03-03 16:48:46 UTC
Please provide a sysreport (or at least /var/log/messages). 

Make sure you have the latest firmware for the ServerRaid 4Lx. 

I believe this controller has a battery-backed up cache, and the cache is not
used if the battery is dead. So please check the ServerRaid BIOS utility for any
errors. Also, let me know the firmware settings you are using for the RAID 5
(e.g. does it let you set the chunk size?). 

I suspect you would not have this problem with a RAID 1. Is it feasible for you
to test this theory? 

Whan did this start happening? Was there a version of Linux where you did not
have this problem?



Comment 3 Johan Dahl 2006-03-13 23:05:47 UTC
Created attachment 126077 [details]
sysreport

Comment 4 Johan Dahl 2006-03-13 23:25:47 UTC
Sysreport provided above

Going to install the latest firmware. 7.12.07 instead of 7.10.18. This means
weekend work for me :-(

The controller has no battery-backed up cache installed. Raid 5 settings is
Strip unit size 8kb which is optimal for file/print servers. If the firmware
update makes no difference should I construct a raidset with 32 or 64kb stripesize?

It is a production system so I needed extra disks to test the raid 1 theory. I
created a Raid 1 setup with two 146 Gb disks and tried making long copies. The
iowait never got over 45% and the system remained responsive. Tested both with
and without LVM but that made no difference.

I am not sure I has ever worked a it should. During the first period we had a
lot of network errors but some small amount of them could have been this problem.






Comment 5 Johan Dahl 2006-03-20 08:03:34 UTC
Firmware updated and problem still exists.
I also updated to Update 7 and kernel  2.4.21-40.ELsmp but the problem still exists.

Comment 6 Tom Coughlan 2006-03-20 14:39:13 UTC
Based on your comment #4, the problem seems specific to RAID 5 on the ServeRaid
adapter. RAID 1 performs well.

I am surprised this board does not have a battery backed up cache. I thought
they all did. You might ask IBM whether this would solve the problem. Maybe they
can also advise you about the optimal stripe size for your workload. 

Beyond that, you can try elvtune
http://www.redhat.com/support/wpapers/redhat/ext3/tuning.html

and you can try adjusting min/max-readahead:

http://www.redhat.com/magazine/001nov04/features/vm/
See "Tuning the VM"

I have a report that the following values helped in at least one situation:

      echo 8192 > /proc/sys/vm/max-readahead
      echo 2048 > /proc/sys/vm/min-readahead

I am going to close this, since it appears to be a ServeRaid RAID 5 performance
problem. Re-open it if there is more to it than that.