183598 – high iowaits with long transfers

Bug 183598 - high iowaits with long transfers

Summary: high iowaits with long transfers

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Tom Coughlan
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-03-02 07:40 UTC by Johan Dahl
Modified:	2007-11-30 22:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-03-20 14:39:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sysreport (381.74 KB, application/x-bzip) 2006-03-13 23:05 UTC, Johan Dahl	no flags	Details
View All

Description Johan Dahl 2006-03-02 07:40:54 UTC

Description of problem:

A IBM x345 with double Xeon 2.8GHz processors is connected to a diskarray (IBM
EXP400) via a ServerRaid 4Lx Ultra Scsi-controller. The diskarray has a Raid 5
configuration with two partitions which is used as physical volumes (PV) in a
volume group (VG). This VG is then used for /home

If I start a long copy (big file) will iowait go up to 90-100% on all processors
and the stay where for the rest of the copy operation. The system will get
unresponsive. This behaviour mostly occurs then a samba client starts a long
copy. However I have got it with nfs too.

If I make the copy to the internal disks is everything OK and iowait never goes
higher than 35%
I have another x345 but with a internal disks and raid 5 with one PV. If I make
a copy on that will iowait never go higher than 30%


Version-Release number of selected component (if applicable):
kernel-2.4.21-37.0.1.ELsmp

How reproducible:
Allways

Steps to Reproduce:
1. Start a long copy (a big file)
2. See iowait rise to 90-100% on all processors
3. Get unsresponsive system
  
Actual results:
Unsresponsive system

Expected results:
Responsive system

Additional info:

Comment 1 Johan Dahl 2006-03-02 07:44:24 UTC

Copying lots of smaller files is no problem.

Comment 2 Tom Coughlan 2006-03-03 16:48:46 UTC

Please provide a sysreport (or at least /var/log/messages). 

Make sure you have the latest firmware for the ServerRaid 4Lx. 

I believe this controller has a battery-backed up cache, and the cache is not
used if the battery is dead. So please check the ServerRaid BIOS utility for any
errors. Also, let me know the firmware settings you are using for the RAID 5
(e.g. does it let you set the chunk size?). 

I suspect you would not have this problem with a RAID 1. Is it feasible for you
to test this theory? 

Whan did this start happening? Was there a version of Linux where you did not
have this problem?

Comment 3 Johan Dahl 2006-03-13 23:05:47 UTC

Created attachment 126077 [details]
sysreport

Comment 4 Johan Dahl 2006-03-13 23:25:47 UTC

Sysreport provided above

Going to install the latest firmware. 7.12.07 instead of 7.10.18. This means
weekend work for me :-(

The controller has no battery-backed up cache installed. Raid 5 settings is
Strip unit size 8kb which is optimal for file/print servers. If the firmware
update makes no difference should I construct a raidset with 32 or 64kb stripesize?

It is a production system so I needed extra disks to test the raid 1 theory. I
created a Raid 1 setup with two 146 Gb disks and tried making long copies. The
iowait never got over 45% and the system remained responsive. Tested both with
and without LVM but that made no difference.

I am not sure I has ever worked a it should. During the first period we had a
lot of network errors but some small amount of them could have been this problem.

Comment 5 Johan Dahl 2006-03-20 08:03:34 UTC

Firmware updated and problem still exists.
I also updated to Update 7 and kernel  2.4.21-40.ELsmp but the problem still exists.

Comment 6 Tom Coughlan 2006-03-20 14:39:13 UTC

Based on your comment #4, the problem seems specific to RAID 5 on the ServeRaid
adapter. RAID 1 performs well.

I am surprised this board does not have a battery backed up cache. I thought
they all did. You might ask IBM whether this would solve the problem. Maybe they
can also advise you about the optimal stripe size for your workload. 

Beyond that, you can try elvtune
http://www.redhat.com/support/wpapers/redhat/ext3/tuning.html

and you can try adjusting min/max-readahead:

http://www.redhat.com/magazine/001nov04/features/vm/
See "Tuning the VM"

I have a report that the following values helped in at least one situation:

      echo 8192 > /proc/sys/vm/max-readahead
      echo 2048 > /proc/sys/vm/min-readahead

I am going to close this, since it appears to be a ServeRaid RAID 5 performance
problem. Re-open it if there is more to it than that.

Note You need to log in before you can comment on or make changes to this bug.