From Bugzilla Helper: User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.1-0.1.9 i686) I upgraded my workstation from Red Hat 7.0 to Wolverine on Wednesday. My workstation is a Dell PowerEdge 1300, 500Mhz Pentium III with 256Mb of RAM and 3 U160 9Gb SCSI drives hanging off a Dell PERC 2/SC controller. When running Red Hat 7.0, I had compiled the kernel using the 2.4.0-0.99.23 SRPM and was running flawlessly with that. After upgrading, and now running 2.4.1-0.9.1, nasty things have happened, namely that the first of my three drives has failed two times in two days. This machine originally ran Windows NT and that drive was marked as failed one time when NT was running so I can't eliminate hardware failure but...it seems to be more than a statistical anomaly that upgrading to wolverine =-> failed hard drive. The two crashes have gone like this. Number one was on the first reboot after I installed Wolverine. I ejected the floppy, and the installer exited and the machine rebooted. Everything came up ok, and I logged in. About two minutes later, the machine stopped responding to commands and screen after screen of SCSI errors scrolled by. I rebooted and the PERC BIOS detected a failed logical drive, which was the first drive of the array. I reconfigured the array the same way as before, rebooted, it fscked my partitions and fixed several problems and I was up and running. That was yesterday. I came in this morning, the machine was hung up, although still responding to pings over the network. I rebooted, the PERC bios flagged the array as degraded and I had to reconfigure it again and reboot. Fsck, fix errors again (although I had to drop to a root prompt this time) and I'm typing this from the workstation, with seemingly no problems (yet). I can think of several things that could be the cause individually or perhaps could be combining to cause the failure: 1) buggy driver 2) correct driver which accentuates a hardware problem (in much the same way that some hardware works fine under Windows but burns up when its actually used in Linux) 3) Hardware failure Regarding 1), I note that you're using an updated megaraid driver (1.14g for 2.4.1-0.9.1 vs. 1.14b for 2.4.0-0.99.23 vs. 1.0.7b for stock 2.4.2), so between 1.14b and 1.14g, maybe there were bugs introduced. Regarding 2), I doubt it, since these are fairly decent quality drives. Regarding 3), I would lean heavily towards this, except for that fact that I didn't start having problems with Linux until after the upgrade and now its died twice in two days (I was running about a month with Red Hat 7.0 prior). Perhaps related is bug # 18949 and also perhaps related is the fact that when I untar a big file (like a stock 2.4.2 kernel), my machine crawls and a bunch of processes sit in the disk-wait mode for a long time. The untarring is very choppy and only runs in spurts (probably buffering something) but because of the disk-wait processes, my load average jumps to 6 or 7 on an otherwise lightly loaded machine and machine access is very noticeably slow. I'm not sure if there is anything that can be done at this point by you - I'm filing this more for the hope that if anyone else has similar problems, then I will know its probably a driver issue. If no one files anything similar, then its almost has to be a hardware issue. Firmware on the PERC2/SC is the latest available from Dell (3.13), although its a few years old. If you can think of anything else for me to try or test, please let me know. Kevin Reproducible: Sometimes Steps to Reproduce: 1.Power machine on 2.Do some work 3.Seemingly at random, although more common at higher loads, disk access will crawl or stop altogether Expected Results: Don't know - just want to get something into the bug database in case others are experiencing weirdness with the megaraid drivers. Kernel: 2.4.1-0.9.1 Megaraid driver: 1.14g Processor: 500 Mhz Pentium III RAM: 256Mb Hard Drives: 3x 9Gb Quantum Atlas IV model #QM309100KN-LW, configured in RAID 0 RAID card: Dell PERC2/SC
You made the drive work extra hard on the upgrade/install, not surprising that a failing drive would fail on that extra work. The slow access problem was caused by a bug in Jens Axboe's patch that fixed some drivers like aacraid and i20 by delivering them small requests. We have that fixed in our current sources, and a fixed kernel should show up in rawhide -- anything 2.4.2-0.1.20 or later will have the fix.