Bug 30350

Summary: Slow scsi access/hangs with upgrade kernel and megaraid module
Product: [Retired] Red Hat Linux Reporter: kevin_myer
Component: kernelAssignee: Michael K. Johnson <johnsonm>
Status: CLOSED RAWHIDE QA Contact: Brock Organ <borgan>
Severity: low Docs Contact:
Priority: medium    
Version: 7.1   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2001-03-02 19:07:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description kevin_myer 2001-03-02 19:07:27 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.1-0.1.9 i686)


I upgraded my workstation from Red Hat 7.0 to Wolverine on Wednesday.  My
workstation is a Dell PowerEdge 1300, 500Mhz Pentium III with 256Mb of RAM
and 3 U160 9Gb SCSI drives hanging off a Dell PERC 2/SC controller.  When
running Red Hat 7.0, I had compiled the kernel using the 2.4.0-0.99.23 SRPM
and was running flawlessly with that.  After upgrading, and now running
2.4.1-0.9.1, nasty things have happened, namely that the first of my three
drives has failed two times in two days.  This machine originally ran
Windows NT and that drive was marked as failed one time when NT was running
so I can't eliminate hardware failure but...it seems to be more than a
statistical anomaly that upgrading to wolverine =-> failed hard drive.

The two crashes have gone like this.  Number one was on the first reboot
after I installed Wolverine.  I ejected the floppy, and the installer
exited and the machine rebooted.  Everything came up ok, and I logged in. 
About two minutes later, the machine stopped responding to commands and
screen after screen of SCSI errors scrolled by.  I rebooted and the PERC
BIOS detected a failed logical drive, which was the first drive of the
array.  I reconfigured the array the same way as before, rebooted, it
fscked my partitions and fixed several problems and I was up and running. 
That was yesterday.  I came in this morning, the machine was hung up,
although still responding to pings over the network.  I rebooted, the PERC
bios flagged the array as degraded and I had to reconfigure it again and
reboot.  Fsck, fix errors again (although I had to drop to a root prompt
this time) and I'm typing this from the workstation, with seemingly no
problems (yet).  I can think of several things that could be the cause
individually or perhaps could be combining to cause the failure:

1) buggy driver
2) correct driver which accentuates a hardware problem (in much the same
way that some hardware works fine under Windows but burns up when its
actually used in Linux)
3)  Hardware failure

Regarding 1), I note that you're using an updated megaraid driver (1.14g
for 2.4.1-0.9.1 vs. 1.14b for 2.4.0-0.99.23 vs. 1.0.7b for stock 2.4.2), so
between 1.14b and 1.14g, maybe there were bugs introduced.  Regarding 2), I
doubt it, since these are fairly decent quality drives.  Regarding 3), I
would lean heavily towards this, except for that fact that I didn't start
having problems with Linux until after the upgrade and now its died twice
in two days (I was running about a month with Red Hat 7.0 prior).

Perhaps related is bug # 18949 and also perhaps related is the fact that
when I untar a big file (like a stock 2.4.2 kernel), my machine crawls and
a bunch of processes sit in the disk-wait mode for a long time.  The
untarring is very choppy and only runs in spurts (probably buffering
something) but because of the disk-wait processes, my load average jumps to
6 or 7 on an otherwise lightly loaded machine and machine access is very
noticeably slow.

I'm not sure if there is anything that can be done at this point by you -
I'm filing this more for the hope that if anyone else has similar problems,
then I will know its probably a driver issue.  If no one files anything
similar, then its almost has to be a hardware issue.

Firmware on the PERC2/SC is the latest available from Dell (3.13), although
its a few years old.  If you can think of anything else for me to try or
test, please let me know.

Kevin

Reproducible: Sometimes
Steps to Reproduce:
1.Power machine on
2.Do some work
3.Seemingly at random, although more common at higher loads, disk access
will crawl or stop altogether
	

Expected Results:  Don't know - just want to get something into the bug
database in case others are experiencing weirdness with the megaraid
drivers.

Kernel: 2.4.1-0.9.1
Megaraid driver: 1.14g
Processor:  500 Mhz Pentium III
RAM: 256Mb
Hard Drives: 3x 9Gb Quantum Atlas IV model #QM309100KN-LW, configured in
RAID 0
RAID card: Dell PERC2/SC

Comment 1 Michael K. Johnson 2001-03-05 17:16:54 UTC
You made the drive work extra hard on the upgrade/install, not surprising
that a failing drive would fail on that extra work.

The slow access problem was caused by a bug in Jens Axboe's patch that
fixed some drivers like aacraid and i20 by delivering them small requests.
We have that fixed in our current sources, and a fixed kernel should show
up in rawhide -- anything 2.4.2-0.1.20 or later will have the fix.