Bug 82231 - SCSI I/O hangs system with NFS writes, nfsd status = DW
Summary: SCSI I/O hangs system with NFS writes, nfsd status = DW
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 8.0
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-01-20 05:56 UTC by Ethan Vanmatre
Modified: 2007-04-18 16:50 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-09-30 15:40:25 UTC
Embargoed:


Attachments (Terms of Use)

Description Ethan Vanmatre 2003-01-20 05:56:44 UTC
I have a Compaq DL380 G3 with a Compaq ultra 3 scsi controller connected to a
RAID array with multiple LUNs.

When writing to the RAID array via NFS RH8.0 hangs. The nfsd processes report DW
and a very high load average. 

This same hardware is usable under RH7.1 however with slower I/O.

Comment 1 Ethan Vanmatre 2003-01-23 06:38:37 UTC
OK, I've had more time to look at this problem. It is not a NFS problem. NFS
just uses the disks via the aic7xxx driver. It appears that kupdated and bdflush
deadlock. I can recreate the scsi hang at any time by causing heavy I/O through
this scsi controller. For example if I start more than one fsck to the disks on
that controller one by one they will hang.

For example I can start 8 fsck to the 8 disks (of 514 GB each) and see the
system hang in a matter minutes. The fsck processes one by one hang as Kupdated
and bdflush DW then SW. After a while all fsck and both kupdated and bdflush and
sometimes one or more kjourneld hang.

I have a ps oxw pid,command,whcannel captured but it is on the system and it is
hung. I'm away from work right now. The system responds to pings but does not
allow ssh into it. Usually you can loginto the system and do most everything
that does not need this scsi adapter.

I have noted that reducing the ammount of memory from 1.5GB to 512 MB appears to
allow the fsck to run longer before hanging.

I can fsck all 8 disks serially but that takes over 8 hours. Doing 2 at a time
usually causes the hang in the first 30 to 40 minutes.

More info tomorrow.


Comment 2 Ethan Vanmatre 2003-01-24 18:30:46 UTC
After replacing the external raid controller I dod not see any change in status.
The system still hung.

I tried getting the 6.2.28 aic7xxx driver but it does not load with 2.4.18-19.8.0*

Thinking the 6.2.8 driver is bad I dropped back to 2.4.18-14 kernel and the
6.2.28  aic7xxx driver loaded.

So the 2.4.18-14* kernel and 6.2.28 aic7xxx driver allowed all 8 fsck to operate
in parallel without a hang.

Problem solved? Well yes. BUT...

But, write rated to the external raid are very strange. Eight array are defigned.

scsi id 2 lun 0 through lun 7

When writing a 1.1 GB file I get most writes completion in 30 sec to a minute
but for 2 of the "disks" the time is 7 and 13 minutes respectivly. And it
appears to change over time.

Any thoughts?

Thanks ,Ethan

Comment 3 Bugzilla owner 2004-09-30 15:40:25 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/



Note You need to log in before you can comment on or make changes to this bug.