129545 – High iowait and system load while copying files on SATA raid drive

Bug 129545 - High iowait and system load while copying files on SATA raid drive

Summary: High iowait and system load while copying files on SATA raid drive

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jeff Garzik
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-08-10 09:43 UTC by Martijn Kint
Modified:	2013-07-03 02:21 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-19 19:21:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vmstat during problem occurrence (2.42 KB, text/plain) 2004-08-19 23:11 UTC, Paul Walmsley	no flags	Details
View All

Description Martijn Kint 2004-08-10 09:43:37 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8a3)
Gecko/20040718 Firefox/0.9.2

Description of problem:
Pretty similar to bug
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=121434 but i
didn't want to pollute that bug anymore and this is happening with a
different controller.

While copying files (either big or small files) from a computer
connected to a network share (samba, nfs, netatalk) or from a local
partition to another partition, the iowait figures go through the roof
(high 90's) and the whole system becomes unresponsive, load goes up
(as high as 6 or 7 depending on the time it takes to copy the files).

My comfiguration:
Dell PowerEdge 700
Dell CERC RAID5 Configuration (4 x 120 GB Seagate 359.8 GB according
to fdisk)
Intel PIV 3.2 Ghz (HyperThreading)
1 GB RAM

RedHat 3.0 ES
Kernel-2.4.21-15.0.3.EL

Also installed Fedora Core 2 and 3 to see if the problem also occured
with the 2.6.x kernels, this was indeed the case, although less
visible but still way too high figures for both iowait and system load.

The module in question is aacraid

Version-Release number of selected component (if applicable):
aacraid-1.1.4-2302

How reproducible:
Always

Steps to Reproduce:
1. Copy a (large) file from one partition to another or from a
computer connected to the server (samba, nfs, netatalk)
2. Open a console and use "top"
3. Watch the iowait stats and system load go up
    

Actual Results:  After a while the system load goes up in to the
extremes (6 to 7) system becomes totally unresponsive and you will
have to wait until the load either drops or do a hard reboot.

Expected Results:  System should not get unresponsive and the iowait
shouldn't be in the high 90's.

Additional info:

I'm seeing the same figures as presented in the above mentioned bug,
but if needed i could post some stats here. Also tried booting with a
non SMP kernel, didn't work.

Comment 1 Martijn Kint 2004-08-19 08:56:34 UTC

Just reconfigured the raid drive to a RAID 1 config (giving me a
storage capacity of 240 GB, ext3). In this case the load remains very
low ( < ~0.5) system stays responsive, this is with the latest kernel
for ES 3.0 (2.4.21-18.ELsmp).

I copied a 4GB folder from my workstation to the samba shared folder
(/dev/sda9) on the server, as stated above, the load remained low.

I'm going to reconfigure it back to a RAID 5 config and see if i'm
getting the same results as with a RAID 1 config.

iostat results:
[root@heinekenserver root]# iostat -k
Linux 2.4.21-18.ELsmp (heinekenserver.localdomain)      08/19/2004
 
avg-cpu:  %user   %nice    %sys %iowait   %idle
           4.67    0.03    2.50    8.67   84.14
 
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              37.65       139.00      2549.57     251329    4609821
sda1              0.02         0.05         0.01         97         17
sda2              4.12        19.84        23.32      35881      42164
sda3              5.85        45.15        55.66      81637     100644
sda4              0.00         0.00         0.00          2          0
sda5              0.02         0.08         0.02        141         36
sda6              0.01         0.05         0.00         84          0
sda7              3.63        27.82        48.83      50305      88284
sda8              0.11         0.12         1.11        221       2000
sda9             22.18         0.48      2420.62        865    4376676

Comment 2 Paul Walmsley 2004-08-19 23:08:07 UTC

Hi Martijn,

we've noticed the same problem on a CERC RAID-1 config with RHEL 3
Update 2.  We've filed a support ticket with Red Hat; it's ticket
354372 .  Also at least one other person on Dell's forums is
experiencing a very similar problem - see

http://forums.us.dell.com/supportforums/board/message?board.id=pes_hardrive&message.id=1588ssage.id=15850

We've also noticed bug 92129 on bugzilla.redhat.com - different
controller (PERC rather than CERC), but we're wondering if the
excessive spinlock holds mentioned by one poster in that thread could
coincide with this problem.

Jeff, any thoughts?  


- Paul

Comment 3 Paul Walmsley 2004-08-19 23:11:34 UTC

Created attachment 102908 [details]
vmstat during problem occurrence

This 'vmstat 1 600' shows two instances when the bug seems to manifest for us.

Comment 4 Martijn Kint 2004-08-20 08:05:35 UTC

Hi Paul,

The link you provided seems broken :) About the RAID 1 configuration,
that actually "solved" the problems for me, i was getting these insane
high iowait and system load figures with a RAID 5 config.

Still didn't get to revert it back to RAID 5, which i kind of need.

About  <a href="show_bug.cgi?id=92129" title="ASSIGNED - (SCSI
AACRAID)kernel: aacraid: Host adapter reset request. SCSI hang ?">bug
92129</a> i did look in to that one, but i'm not getting time-outs

Comment 5 Paul Walmsley 2004-08-20 14:02:54 UTC

sorry about that, Martijn, I'll try again:

http://forums.us.dell.com/supportforums/board/message?board.id=pes_hardrive&message.id=15850

I can definitely confirm that a similar problem happens here on our
RAID-1 setup - perhaps it is worse on RAID-5, or perhaps we're seeing
two separate problems with similar symptoms?

Comment 6 James Ryley 2005-07-07 06:11:55 UTC

We are also seeing hi iowait figures on a system using Core 3 and RAID 5 on a 
3ware 9500-S.  Any resolution to this issue?  Does it still exist in RHEL4?

Comment 7 Tami Nicks 2005-07-16 06:33:55 UTC

We are seeing it with a Clariion RAID 10 connected to a DL585 AMD, using 
qlogic.  It continues to bring down the site, either via a reboot or system 
degradation.  Running the latest linux kernel.  Anybody have anything?

Comment 8 Daniel Rogers 2006-06-14 18:00:51 UTC

I am seeing this problem with a Dell CERC 6Ch raid controller on RHEL 4.  My
driver version is 1.1-5[2412].

I have a:
Dell Poweredge 1800
3.0 GHz Dual Pentium Xeon in 64 bit mode
4 GB RAM
3x80 SATA Drives connected to a Dell CERC 6ch RAID controller in a RAID-5
configuration.
LVM is being used to manage the RAID device.
Up to date, with a non-tainted kernel.

Exactly the same cause.  Copying files to the disk (in my case from a DVD)
results in extremely high iowait.  The system becomes almost completely
unresponsive until the disk activity stops.

The unresponsivness lasts 5 minutes or so, which is long enough to cause network
timeouts, and so is a reliability problem (not just a performance problem).

I am more than happy to look into this, provide debugging information, try new
kernels, etc.

Comment 9 RHEL Program Management 2007-10-19 19:21:10 UTC

This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.