Bug 158169

Summary: megaraid driver for x86_64 causes data corruption
Product: Red Hat Enterprise Linux 4 Reporter: Need Real Name <janderdepeich>
Component: kernelAssignee: Tom Coughlan <coughlan>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: davej
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-01-02 14:01:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 176344    

Description Need Real Name 2005-05-19 11:12:44 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050513 Fedora/1.0.4-1.3.1 Firefox/1.0.4

Description of problem:
megaraid driver seems to cause data corruption randomly. Sometimes the filesystem cannot be safely used for more than a few seconds, sometimes it stays usable for hours.

Since this does not happen in RHEL 3 nor in RHEL4_i386, it should be a driver problem.

We use an Intel SRCS16 raid controller, configured as a raid 5 volume (3 physical disks, serial ata). Data corruption tends to manifest sooner when write back policy is enabled on the controller, but it also happens with write through.

Version-Release number of selected component (if applicable):
kernel-2.6.9-5.0.5

How reproducible:
Sometimes

Steps to Reproduce:
1. Start any kind of heavy I/O on the SRCS16 controller for some time (usually 10 minutes are enough)
2. Check the filesystem with fsck
3. There are severe errors on the filesystem
  

Actual Results:  Sometimes, files become multi-terabyte lose their names or suddenly disappear, sometimes the journal aborts, and some other times dmesg shows that the driver had to reset the controller as a result of repeated failures.

Expected Results:  data should not be corrupted.

Additional info:

This is a dual-xeon system with EMT64 technology. Tests were done with the SMP kernel. It is not at production _right now_, so I should be able to help testing at least for a few days.

Comment 2 Need Real Name 2005-07-06 08:51:51 UTC

*** This bug has been marked as a duplicate of 141360 ***

Comment 3 Ernie Petrides 2005-07-21 21:39:04 UTC
Reopening -- please don't dup bugs across different product versions.

Comment 7 Tom Coughlan 2006-06-27 16:33:08 UTC
This problem may be a manifestation of bug 194533. Please test the kernel, or
driver patch, that is posted there if possible. 

Comment 9 Tom Coughlan 2006-06-29 15:54:57 UTC
I have updated the patch, and the test kernel, posted in BZ 194533. Please test. 

Comment 10 Tom Coughlan 2006-06-29 19:53:29 UTC
As you may have seen from the patch, one problem with the current driver is that
it enables 64-bit DMA on some adapter models that do not support it. I would
like to find out if your adapter is one of them. This will indicate whether the
patch may be the right fix. Please provide the output of

lspci -xxx
lspci -n

on a system that exhibits the failure. Also please send /var/log/messages, or
dmesg, that shows the messages when the megaraid driver loads. That will give me
the fw rev, and any other relevant messages. 

Thanks. 


Comment 12 Daniel Riek 2006-11-21 16:49:11 UTC
Raising as an Exception as we need to find out if we are going to address this
or not. I doubt it will ever get addressed as the underlying IT was closed so my
recommendation is to close it.



Comment 13 Daniel Riek 2007-01-02 13:51:23 UTC
PM NAK based on comment 12 and the lack of activity.

Comment 14 RHEL Program Management 2007-01-02 14:01:19 UTC
Product Management has reviewed and declined this request.  You may appeal this
decision by reopening this request.