153760 – Disk access to Adaptec 2400A (I2O driver) stops working

Bug 153760 - Disk access to Adaptec 2400A (I2O driver) stops working

Summary: Disk access to Adaptec 2400A (I2O driver) stops working

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	3
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-04-05 18:06 UTC by Nick Mossie
Modified:	2015-01-04 22:18 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-05-20 09:20:40 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/var/log/dmesg after reboot of machine in question (10.04 KB, text/plain) 2005-04-05 18:08 UTC, Nick Mossie	no flags	Details
View All

Description Nick Mossie 2005-04-05 18:06:13 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.2) Gecko/20040803

Description of problem:
The Adaptec 2400A controller running the kernel driver I2O becomes unresponsive and doesn't recover when moving data around on the RAID1 array. Booting and installation went smoothly, but operation is not.

Version-Release number of selected component (if applicable):
kernel-2.6.10-1.770_FC3

How reproducible:
Sometimes

Steps to Reproduce:
1. Start 'top' on tty1
2. Let the system stay running for about an hour.
3. Starting copying about 100meg or so of data around on the RAID drive.
4. Do a 'ls' on the data we just copied.

Actual Results: Load average shot up immediately to 4.xx and eventually tops out about 300.x. System becomes un responsive. No commands can be run. I notice no process in top that is taking any large amount of system resources.

Expected Results: Load average should have stayed about the same, perhaps a max of a point higher, system should have remained stable, and the files from 'ls' should have printed out.

Additional info:

The hard drives are fine. If I plug a hard drive in to the onboard controller, the system works just fine for days and days.

It's really as if the controller is just "lost" somehow after awhil... like all hard disk access is cut off after I move around a bunch of files.

I don't see anything in any of the logs... which I suppose I wouldn't because those are on the disk which is "lost".

I'll attach my /var/log/dmesg

Comment 1 Nick Mossie 2005-04-05 18:08:00 UTC

Created attachment 112724 [details]
/var/log/dmesg after reboot of machine in question

Comment 2 Nick Mossie 2005-04-06 15:11:41 UTC

I did find some I/O errors in the log files after all... the errors started on 
March 31st.  This FC3 system was started on March 16th.

Here is the first occurance of those errors.  This happended right after I 
FTP'd some data from another server and did the 'ls' on the newly copied data.

Mar 31 14:42:38 tellurian kernel: /dev/i2o/hda error: Failure communicating to 
device<3>.
Mar 31 14:42:38 tellurian kernel: end_request: I/O error, dev i2o/hda, sector 
307591262
Mar 31 14:42:38 tellurian kernel: Buffer I/O error on device i2o/hda8, logical 
block 8216578
Mar 31 14:42:38 tellurian kernel: lost page write due to I/O error on i2o/hda8
Mar 31 14:42:41 tellurian kernel: Buffer I/O error on device i2o/hda8, logical 
block 8216579
Mar 31 14:42:41 tellurian kernel: lost page write due to I/O error on i2o/hda8
Mar 31 14:42:41 tellurian kernel: Buffer I/O error on device i2o/hda8, logical 
block 8216580

.. repeats with different blocks....

So it made it through that night, it recovered and didn't hit a very high load 
average.  But later that night, or early April 1, the load average went through 
the roof again, and nothing was written to the logs after 1:42am after a long 
string of these errors.  Eventually at 10am the next day we reset the server.

Here is a df -h

Filesystem            Size  Used Avail Use% Mounted on
/dev/i2o/hda6         2.9G  208M  2.6G   8% /
/dev/i2o/hda1          99M   11M   83M  12% /boot
none                 1014M     0 1014M   0% /dev/shm
/dev/i2o/hda8          70G  272M   67G   1% /home
/dev/i2o/hda7         981M   21M  911M   3% /tmp
/dev/i2o/hda2          49G  1.3G   45G   3% /usr
/dev/i2o/hda5          24G  296M   22G   2% /var

Comment 3 Nick Mossie 2005-04-12 14:03:31 UTC

Since I was unable to resolve this I have just created a Linux Software RAID of
the two drives and have since removed the 2400A controller.  All is well now.  I
realize this bug could have really been a hardware incompatability or a faulty
controller.  We still have the controller and may try to use it in another
machine to see what happens.

Comment 4 Warren Togami 2005-04-13 00:43:38 UTC

Would you consider donating the card to a kernel developer?  If you could get it
to Markus Lidel in Germany he might be able to figure out how to make the driver
work with that card forever more in the future.  Although it probably wont
benefit you directly, you would help everyone else in the future.

That is assuming that your card is not defective though...

Note You need to log in before you can comment on or make changes to this bug.