I have a Compaq DL380 G3 with a Compaq ultra 3 scsi controller connected to a RAID array with multiple LUNs. When writing to the RAID array via NFS RH8.0 hangs. The nfsd processes report DW and a very high load average. This same hardware is usable under RH7.1 however with slower I/O.
OK, I've had more time to look at this problem. It is not a NFS problem. NFS just uses the disks via the aic7xxx driver. It appears that kupdated and bdflush deadlock. I can recreate the scsi hang at any time by causing heavy I/O through this scsi controller. For example if I start more than one fsck to the disks on that controller one by one they will hang. For example I can start 8 fsck to the 8 disks (of 514 GB each) and see the system hang in a matter minutes. The fsck processes one by one hang as Kupdated and bdflush DW then SW. After a while all fsck and both kupdated and bdflush and sometimes one or more kjourneld hang. I have a ps oxw pid,command,whcannel captured but it is on the system and it is hung. I'm away from work right now. The system responds to pings but does not allow ssh into it. Usually you can loginto the system and do most everything that does not need this scsi adapter. I have noted that reducing the ammount of memory from 1.5GB to 512 MB appears to allow the fsck to run longer before hanging. I can fsck all 8 disks serially but that takes over 8 hours. Doing 2 at a time usually causes the hang in the first 30 to 40 minutes. More info tomorrow.
After replacing the external raid controller I dod not see any change in status. The system still hung. I tried getting the 6.2.28 aic7xxx driver but it does not load with 2.4.18-19.8.0* Thinking the 6.2.8 driver is bad I dropped back to 2.4.18-14 kernel and the 6.2.28 aic7xxx driver loaded. So the 2.4.18-14* kernel and 6.2.28 aic7xxx driver allowed all 8 fsck to operate in parallel without a hang. Problem solved? Well yes. BUT... But, write rated to the external raid are very strange. Eight array are defigned. scsi id 2 lun 0 through lun 7 When writing a 1.1 GB file I get most writes completion in 30 sec to a minute but for 2 of the "disks" the time is 7 and 13 minutes respectivly. And it appears to change over time. Any thoughts? Thanks ,Ethan
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/