Red Hat Bugzilla – Bug 82231
SCSI I/O hangs system with NFS writes, nfsd status = DW
Last modified: 2007-04-18 12:50:09 EDT
I have a Compaq DL380 G3 with a Compaq ultra 3 scsi controller connected to a
RAID array with multiple LUNs.
When writing to the RAID array via NFS RH8.0 hangs. The nfsd processes report DW
and a very high load average.
This same hardware is usable under RH7.1 however with slower I/O.
OK, I've had more time to look at this problem. It is not a NFS problem. NFS
just uses the disks via the aic7xxx driver. It appears that kupdated and bdflush
deadlock. I can recreate the scsi hang at any time by causing heavy I/O through
this scsi controller. For example if I start more than one fsck to the disks on
that controller one by one they will hang.
For example I can start 8 fsck to the 8 disks (of 514 GB each) and see the
system hang in a matter minutes. The fsck processes one by one hang as Kupdated
and bdflush DW then SW. After a while all fsck and both kupdated and bdflush and
sometimes one or more kjourneld hang.
I have a ps oxw pid,command,whcannel captured but it is on the system and it is
hung. I'm away from work right now. The system responds to pings but does not
allow ssh into it. Usually you can loginto the system and do most everything
that does not need this scsi adapter.
I have noted that reducing the ammount of memory from 1.5GB to 512 MB appears to
allow the fsck to run longer before hanging.
I can fsck all 8 disks serially but that takes over 8 hours. Doing 2 at a time
usually causes the hang in the first 30 to 40 minutes.
More info tomorrow.
After replacing the external raid controller I dod not see any change in status.
The system still hung.
I tried getting the 6.2.28 aic7xxx driver but it does not load with 2.4.18-19.8.0*
Thinking the 6.2.8 driver is bad I dropped back to 2.4.18-14 kernel and the
6.2.28 aic7xxx driver loaded.
So the 2.4.18-14* kernel and 6.2.28 aic7xxx driver allowed all 8 fsck to operate
in parallel without a hang.
Problem solved? Well yes. BUT...
But, write rated to the external raid are very strange. Eight array are defigned.
scsi id 2 lun 0 through lun 7
When writing a 1.1 GB file I get most writes completion in 30 sec to a minute
but for 2 of the "disks" the time is 7 and 13 minutes respectivly. And it
appears to change over time.
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases,
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/