Bug 511911

Summary: System hangs with "rejecting I/O to offline device"
Product: Red Hat Enterprise Linux 5 Reporter: Scott Marlowe <scott.marlowe>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: low    
Version: 5.3CC: mikki
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-02 13:01:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Scott Marlowe 2009-07-15 15:46:22 UTC
Description of problem:

System randomly enters a state where RAID arrays for Areca 1680 controller go offline.  Occurs about once a month.  These are hard working servers (typical load factors of 4 to 12) but the failure can happen anytime under any load.

Version-Release number of selected component (if applicable):
Was running Centos 5.1 kernel 2.6.18-92.el5 (SMP) with no problems.
Updated to 5.3 online, kernel 2.6.18-128.1.14.el5 in May, problems started.  Have had two identical hangs.  The other server I have (the primary) is still on older kernel, no failures.

How reproducible:
Very, just takes a long time to reproduce.

Steps to Reproduce:
1. Install an Areca 1680 RAID card.  
2. Setup RAID, install to it, use the above mentioned kernel.
3. Wait about a month.
  
Actual results:
Error as seen from console:
Ext3-fs error (device SDA1) ext3_get_inode_bc: unable to read inode block inode=3909144 block=3932162
SD 0:0:0:0: rejecting I/O to offline device

Note that these two machines, running the old kernel, have been 100% rock solid stable for about 9 months while being worked pretty hard.  

I checked the bug db, and the closest bug I found was 460789 but that one seemed to produce the problem almost immediately.  I have about one more month that I can run my backup server on a different kernel to see if this is fixed, then I'll be back to running the old kernel cause it works and I can't afford downtime during a school year.

Comment 1 Scott Marlowe 2009-07-15 15:55:54 UTC
P.s. more hardware info, snagged from lshw:

TYAN Computer Corporation S2932 mobo
Dual Quad-Core AMD Opteron(tm) Processor 2352
32G 667MHz DDR2 memory
PCI bridge NVidia MCP55 PCI Express bridge
ARC-1680 8 port PCIe/PCI-X to SAS/SATA II RAID Controller (actually shows up as two of these, as I have 16 drives)
Two logical volumes, one for OS / logs one for pgsql database

Comment 2 mikki 2009-12-30 18:41:03 UTC
Please update ARCMSR to a newer driver. The 15RH1 driver is unstable and since long replaced by Areca.

I updated to ftp://ftp.areca.com.tw/RaidCards/AP_Drivers/Linux/DRIVER/SourceCode/arcmsr.1.20.0X.15-81103.zip and I have had no hangs.

Mine occured once or twice per week, but no hangs yet.

The newer driver should be committed to the kernel tree and I have also posted this on kernel.org and on centos.org

Comment 3 Scott Marlowe 2010-01-08 03:47:13 UTC
So, when will the next kernel with this driver in it be out?  Or is it already out?  Or is this one of those things that old versions of RHEL will never have backported to it?

Should I just dl and compile my own driver or what?

Comment 4 RHEL Program Management 2014-03-07 12:13:24 UTC
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in the  last planned RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX. To request that Red Hat re-consider this request, please re-open the bugzilla via  appropriate support channels and provide additional business and/or technical details about its importance to you.

Comment 5 RHEL Program Management 2014-06-02 13:01:55 UTC
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).