From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.2) Gecko/20040803 Description of problem: The Adaptec 2400A controller running the kernel driver I2O becomes unresponsive and doesn't recover when moving data around on the RAID1 array. Booting and installation went smoothly, but operation is not. Version-Release number of selected component (if applicable): kernel-2.6.10-1.770_FC3 How reproducible: Sometimes Steps to Reproduce: 1. Start 'top' on tty1 2. Let the system stay running for about an hour. 3. Starting copying about 100meg or so of data around on the RAID drive. 4. Do a 'ls' on the data we just copied. Actual Results: Load average shot up immediately to 4.xx and eventually tops out about 300.x. System becomes un responsive. No commands can be run. I notice no process in top that is taking any large amount of system resources. Expected Results: Load average should have stayed about the same, perhaps a max of a point higher, system should have remained stable, and the files from 'ls' should have printed out. Additional info: The hard drives are fine. If I plug a hard drive in to the onboard controller, the system works just fine for days and days. It's really as if the controller is just "lost" somehow after awhil... like all hard disk access is cut off after I move around a bunch of files. I don't see anything in any of the logs... which I suppose I wouldn't because those are on the disk which is "lost". I'll attach my /var/log/dmesg
Created attachment 112724 [details] /var/log/dmesg after reboot of machine in question
I did find some I/O errors in the log files after all... the errors started on March 31st. This FC3 system was started on March 16th. Here is the first occurance of those errors. This happended right after I FTP'd some data from another server and did the 'ls' on the newly copied data. Mar 31 14:42:38 tellurian kernel: /dev/i2o/hda error: Failure communicating to device<3>. Mar 31 14:42:38 tellurian kernel: end_request: I/O error, dev i2o/hda, sector 307591262 Mar 31 14:42:38 tellurian kernel: Buffer I/O error on device i2o/hda8, logical block 8216578 Mar 31 14:42:38 tellurian kernel: lost page write due to I/O error on i2o/hda8 Mar 31 14:42:41 tellurian kernel: Buffer I/O error on device i2o/hda8, logical block 8216579 Mar 31 14:42:41 tellurian kernel: lost page write due to I/O error on i2o/hda8 Mar 31 14:42:41 tellurian kernel: Buffer I/O error on device i2o/hda8, logical block 8216580 .. repeats with different blocks.... So it made it through that night, it recovered and didn't hit a very high load average. But later that night, or early April 1, the load average went through the roof again, and nothing was written to the logs after 1:42am after a long string of these errors. Eventually at 10am the next day we reset the server. Here is a df -h Filesystem Size Used Avail Use% Mounted on /dev/i2o/hda6 2.9G 208M 2.6G 8% / /dev/i2o/hda1 99M 11M 83M 12% /boot none 1014M 0 1014M 0% /dev/shm /dev/i2o/hda8 70G 272M 67G 1% /home /dev/i2o/hda7 981M 21M 911M 3% /tmp /dev/i2o/hda2 49G 1.3G 45G 3% /usr /dev/i2o/hda5 24G 296M 22G 2% /var
Since I was unable to resolve this I have just created a Linux Software RAID of the two drives and have since removed the 2400A controller. All is well now. I realize this bug could have really been a hardware incompatability or a faulty controller. We still have the controller and may try to use it in another machine to see what happens.
Would you consider donating the card to a kernel developer? If you could get it to Markus Lidel in Germany he might be able to figure out how to make the driver work with that card forever more in the future. Although it probably wont benefit you directly, you would help everyone else in the future. That is assuming that your card is not defective though...