From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8a3) Gecko/20040718 Firefox/0.9.2 Description of problem: Pretty similar to bug https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=121434 but i didn't want to pollute that bug anymore and this is happening with a different controller. While copying files (either big or small files) from a computer connected to a network share (samba, nfs, netatalk) or from a local partition to another partition, the iowait figures go through the roof (high 90's) and the whole system becomes unresponsive, load goes up (as high as 6 or 7 depending on the time it takes to copy the files). My comfiguration: Dell PowerEdge 700 Dell CERC RAID5 Configuration (4 x 120 GB Seagate 359.8 GB according to fdisk) Intel PIV 3.2 Ghz (HyperThreading) 1 GB RAM RedHat 3.0 ES Kernel-2.4.21-15.0.3.EL Also installed Fedora Core 2 and 3 to see if the problem also occured with the 2.6.x kernels, this was indeed the case, although less visible but still way too high figures for both iowait and system load. The module in question is aacraid Version-Release number of selected component (if applicable): aacraid-1.1.4-2302 How reproducible: Always Steps to Reproduce: 1. Copy a (large) file from one partition to another or from a computer connected to the server (samba, nfs, netatalk) 2. Open a console and use "top" 3. Watch the iowait stats and system load go up Actual Results: After a while the system load goes up in to the extremes (6 to 7) system becomes totally unresponsive and you will have to wait until the load either drops or do a hard reboot. Expected Results: System should not get unresponsive and the iowait shouldn't be in the high 90's. Additional info: I'm seeing the same figures as presented in the above mentioned bug, but if needed i could post some stats here. Also tried booting with a non SMP kernel, didn't work.
Just reconfigured the raid drive to a RAID 1 config (giving me a storage capacity of 240 GB, ext3). In this case the load remains very low ( < ~0.5) system stays responsive, this is with the latest kernel for ES 3.0 (2.4.21-18.ELsmp). I copied a 4GB folder from my workstation to the samba shared folder (/dev/sda9) on the server, as stated above, the load remained low. I'm going to reconfigure it back to a RAID 5 config and see if i'm getting the same results as with a RAID 1 config. iostat results: [root@heinekenserver root]# iostat -k Linux 2.4.21-18.ELsmp (heinekenserver.localdomain) 08/19/2004 avg-cpu: %user %nice %sys %iowait %idle 4.67 0.03 2.50 8.67 84.14 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 37.65 139.00 2549.57 251329 4609821 sda1 0.02 0.05 0.01 97 17 sda2 4.12 19.84 23.32 35881 42164 sda3 5.85 45.15 55.66 81637 100644 sda4 0.00 0.00 0.00 2 0 sda5 0.02 0.08 0.02 141 36 sda6 0.01 0.05 0.00 84 0 sda7 3.63 27.82 48.83 50305 88284 sda8 0.11 0.12 1.11 221 2000 sda9 22.18 0.48 2420.62 865 4376676
Hi Martijn, we've noticed the same problem on a CERC RAID-1 config with RHEL 3 Update 2. We've filed a support ticket with Red Hat; it's ticket 354372 . Also at least one other person on Dell's forums is experiencing a very similar problem - see http://forums.us.dell.com/supportforums/board/message?board.id=pes_hardrive&message.id=1588ssage.id=15850 We've also noticed bug 92129 on bugzilla.redhat.com - different controller (PERC rather than CERC), but we're wondering if the excessive spinlock holds mentioned by one poster in that thread could coincide with this problem. Jeff, any thoughts? - Paul
Created attachment 102908 [details] vmstat during problem occurrence This 'vmstat 1 600' shows two instances when the bug seems to manifest for us.
Hi Paul, The link you provided seems broken :) About the RAID 1 configuration, that actually "solved" the problems for me, i was getting these insane high iowait and system load figures with a RAID 5 config. Still didn't get to revert it back to RAID 5, which i kind of need. About <a href="show_bug.cgi?id=92129" title="ASSIGNED - (SCSI AACRAID)kernel: aacraid: Host adapter reset request. SCSI hang ?">bug 92129</a> i did look in to that one, but i'm not getting time-outs
sorry about that, Martijn, I'll try again: http://forums.us.dell.com/supportforums/board/message?board.id=pes_hardrive&message.id=15850 I can definitely confirm that a similar problem happens here on our RAID-1 setup - perhaps it is worse on RAID-5, or perhaps we're seeing two separate problems with similar symptoms?
We are also seeing hi iowait figures on a system using Core 3 and RAID 5 on a 3ware 9500-S. Any resolution to this issue? Does it still exist in RHEL4?
We are seeing it with a Clariion RAID 10 connected to a DL585 AMD, using qlogic. It continues to bring down the site, either via a reboot or system degradation. Running the latest linux kernel. Anybody have anything?
I am seeing this problem with a Dell CERC 6Ch raid controller on RHEL 4. My driver version is 1.1-5[2412]. I have a: Dell Poweredge 1800 3.0 GHz Dual Pentium Xeon in 64 bit mode 4 GB RAM 3x80 SATA Drives connected to a Dell CERC 6ch RAID controller in a RAID-5 configuration. LVM is being used to manage the RAID device. Up to date, with a non-tainted kernel. Exactly the same cause. Copying files to the disk (in my case from a DVD) results in extremely high iowait. The system becomes almost completely unresponsive until the disk activity stops. The unresponsivness lasts 5 minutes or so, which is long enough to cause network timeouts, and so is a reliability problem (not just a performance problem). I am more than happy to look into this, provide debugging information, try new kernels, etc.
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.