Bug 141360 - megaraid2 driver corrupts files during heavy I/O
Summary: megaraid2 driver corrupts files during heavy I/O
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: Tom Coughlan
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-11-30 17:54 UTC by Rick Mohr
Modified: 2007-11-30 22:07 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-10-19 19:12:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sysreport output (501.93 KB, application/x-bzip2)
2004-12-03 18:37 UTC, Rick Mohr
no flags Details

Description Rick Mohr 2004-11-30 17:54:23 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3)
Gecko/20040922

Description of problem:
During some large file tranfers w/ rsync, I noticed that some files
appeared to have been corrupted.  After some testing, I have narrowed
the problem down to the meagraid2 driver.

I am running 2.4.21-20.ELsmp kernel on dual Xeon box w/ a MegaRaid
150-6 SATA controller (3x 250GB drives set up as one RAID 5 device). 
The filesystem is ext3.  For my test, I used one of the user home dirs
which contained about 12GB of data.  When I rsync'd the dir to another
location on disk, some files would be corrupted.  I determined this by
running a diff on the two directories.

For some of the files that reported differences, I would compute md5
sum for the original file and also the copy.  The two values differed.
 Then I ran debugfs and manually travered filesystem and dumped the
"copy" to a third location.  Upon computing the md5 sum of this third
version, it matched the original.  This seemed to indicate that the
on-disk copy was accurate, but the process of reading the file for the
diff (and ensuing md5) caused bad data to be returned.  I unmounted
and then remounted the filesystem.  The original file and the copy now
 matched.

However, for some files, the copy did genuinely differ from the
original.  Presumably this was because of bad reads during the initial
rsync.

Everytime I ran this test, I saw the same behavior.  When I tried the
rsync with smaller dirs (like linux source code), it would sometimes
work and sometimes it would corrupt files.  I tried the megaraid2,
megaraid_2101, and megaraid_2009 modules all with the same result. 
However, using just the megaraid module (v1.18 I believe) the rsync
test succeeded without a single file corruption.

Version-Release number of selected component (if applicable):
kernel-2.4.21-20.ELsmp

How reproducible:
Always

Steps to Reproduce:
1. rsync user home dir (~12GB) to another location on RAID partition.
2. Run diff to compare original and copy.
3.
    

Actual Results:  Some files are corrupted during the transfer.

Expected Results:  All files should be identical.

Additional info:

Comment 1 Tom Coughlan 2004-12-03 14:10:04 UTC
This is the first report of a probem like this.  I will try to
reproduce it. Please post a sysreport so I can match your config as
close as possible.  I'd also like to check your firmware rev. Run the
sysreport after you run the rsync test and detect corruption, so I can
chack for any errors in the log.

Comment 2 Rick Mohr 2004-12-03 18:37:02 UTC
Created attachment 107857 [details]
sysreport output

Comment 3 Rick Mohr 2004-12-03 18:38:12 UTC
For the LSI MegaRaid 150-6 card:

Firmware = 713G
BIOS = G117

Let me know if you need other info.


Comment 4 Tom Coughlan 2004-12-09 14:12:58 UTC
These messages from /var/log/messages appear to be relevant:

Dec  3 09:48:13 nfs2 kernel: attempt to access beyond end of device
Dec  3 09:48:13 nfs2 kernel: 08:07: rw=0, want=1028465668, limit=454792086
Dec  3 09:48:13 nfs2 kernel: attempt to access beyond end of device
Dec  3 09:48:13 nfs2 kernel: 08:07: rw=0, want=1424023736, limit=454792086
Dec  3 09:48:13 nfs2 kernel: attempt to access beyond end of device
Dec  3 09:48:13 nfs2 kernel: 08:07: rw=0, want=1028465668, limit=454792086

Can you confirm that 

1. these occur during the timeframe of the corruption
2. there are messages like this each time you do the test and see
corruption
3. you are copying to /dev/sda7 (from root on /dev/sda3?)

Thanks,

Tom




Comment 5 Rick Mohr 2004-12-09 21:52:40 UTC
I reran the rsync test again today, and noted the start and stop times
so I could compare them with the system logs.  The rsync test failed
of course, and in /var/log/messages there was one entry (like those
above) during the timeframe of the test.  After the rsync, I ran a
diff to compare the files, and during that another entry appeared in
the logs.

So the above errors occurred exactly twice: once during rsync and once
during diff.  I cannot verify that the messages appear every time
since I had not been looking at the log after every test run.

As far as the devices go, I am copying from /dev/sda7 to /dev/sda7. 
The /dev/sda3 device is not part of the test.

I also checked the status of the raid device just to make sure the
errors weren't the result of a degraded array.  The array appears to
be perfectly fine and its state is shown as "optimal".

Comment 6 Tom Coughlan 2004-12-20 19:08:44 UTC
We have run two megaraid2 systems for several days without being able to
reproduce the problem. So, we would like to ask your help running a debug kernel
in order to hopefully get more information about the failure. 

The kernel at:

http://people.redhat.com/coughlan/RHEL3-io-debug/kernel-smp-2.4.21-27.3.EL_io_debug.i686.rpm

is a snapshot of our latest development kernel. It has all the changes in the
Update 4 kernel (being released today), plus some additional debug aids turned
on. Of particular note are the following:

- There is an updated megaraid2 driver.
- The system will BUG() if it detects "attempt to access beyond end of device"
- Slab debug and spinlock debug aids are turned on.

Please set up a serial console line so that you capture the full stack trace
etc. if the systems crashes.  If this does not turn out to be enough info, we
may ask you to try to capture a disk or net dump. 

Thanks for your help.

Tom



Comment 7 Rick Mohr 2005-01-06 16:11:40 UTC
Well, we sent one system back to the vendor for a fix to this problem.
 Their solution was to replace the LSI card with one from 3Ware.  So
now I no longer have a test system to run the debug kernel.

If I get a chance, I might try to set up the remaining machine to run
the debug kernel before I need to ship it back to the vendor. 
Otherwise, this bug may end up going unresolved.

Comment 8 Rick Mohr 2005-02-15 15:24:49 UTC
I was not able to run any additional testing before the last server
was sent back for repairs.  So I am afraid that I cannot provide any
further information to help out.

Comment 10 Thomas 2005-03-03 16:33:46 UTC
Hello, we have the same problem here with the megaraid2 driver version
2.10.8.2-RH1 provided with the RHEL 3.0 update 4 (2.4.21-27.0.2 x86_64
SMP).
Our platform is an Intel SE7525GP2 (Bios P07) motherboard with 2 dual
Nocona coupled with a Intel SRCS16 Raid card (rebranded LSI 150-6
Board, BIOS  G401,Firmware 713N ).

The problem :
We have configured two RAID 1 volumes, each consisting of 250GB SATA
disks.
When we made an installation with a big "/" partition (230GB), it
completed correctly but just after the first reboot, some random files
have been corrupted. FSCK complain that the system wasn't shutdown
correctly and run endlessly because it found errors. It is always
reproductible.
Using a small partition, 8GB in our test , seems to "solve" the problem.

If you need more information let me know.



Comment 11 Thomas 2005-03-08 14:14:03 UTC
Update :

I have to share some new discovery :
- The hardware has been changed and works perfectly fine. 
- The corruption happened only on the 1st volume (/dev/sda), whatever
it is (a single disk (JBOD) or a RAID 1 volume).
- The megaraid or megaraid2 drivers produce the same results.
- The RH3 x86 works fine.

Should I open a different bug has it is a little different ?

Comment 12 Tom Coughlan 2005-03-08 14:37:35 UTC
Please provide more details:

1. Exactly what hardware was changed? For example, did you replace the
HBA with the same make/model/revision and now the problem is gone?

2. You implied that the size of the volume may be a factor. When you
switched from RAID to JBOD, did the size remain the same?

3. The failure is always on the first volume. Is that always the
volume that has the o.s. installed in it?

4. You switched from the x86_64 SMP kernel to the x86 (SMP?) kernel on
the same system, and there is no failure, is that right?

Are there any messages in the log? In particular, please search for
"attempt to access beyond end of device" messages, as seen in the
earlier report. It would be best to post a sysreport for the failing
system.

Tom

Comment 13 Thomas 2005-03-08 15:57:57 UTC
1. We tried to replace the HBA with the same model, the hard drives
with the same hard drives. The point was to be sure that the hardware
is not faulty.The problem was still present.
We will try with another storage controller as soon as we have one.

2. Yes I used a "/" of 230GB and 2GB of swap in both cases 

3. No I installed the OS on an other volume (sdb in this case) , and
the installation went fine. But as soon as I transfered data on sda,
it was corrupted. (around 40GB in my tests)

4. Yes x86 SMP. No problems anymore.

5. No message in syslog


Update : The installation went fine with 2GB of memory (2 working
modules of 1GB)  on x86_64, when we plugged the 4GB (2 more working
modules) corruption again...

To sum-up : the system has 4GB of physical memory, if we activate the
BIOS option "remap memory" (which remaps 1GB of memory beyond the 4GB
because of the overlap with the PCI register adresses) the corruption
occurs.
If we don't use the previous option, the bios and linux saw 3GB of
memory and no corruption occurs.

I'll do more investigation and let you know.
All suggestions are welcomed.

Thank you for your time.

Thomas

Comment 14 Thomas 2005-03-09 11:24:53 UTC
Update :

We tried few other storage controllers (3ware and LSI Scsi) that use
different kernel driver (3w-9xxx for example), and no corruption occurs.
So imho, the problem is related to megaraid2 working in x86_64 mode (
megaraid as well).

What we need to clarify :
- why it's not happenning when we use the x86 SMP version
- why the corruption occurs at the first reboot and not during the
installation process.(different usage of memory ?)

I'll have to deliver the system to the client today, but I'll build
another in 1-2 days to investigate and solve this problem because it
is not a good solution for the client to loose 1 GB of memory :) 

Thomas.

Comment 15 Need Real Name 2005-07-06 08:50:20 UTC
I can confirm this happens with megaraid/megaraid2/megaraid_mbox drivers if (and
only if, according to my tests) the system has 4 Gb of memory (or posibly more).
The size of the partition doesn't seem to matter, as neither does 32 or 64 bits,
not RHEL3 or RHEL4. It is just a matter of time for the corruption to occur, as
long as there are 4 Gb of RAM.

From my experience, the best way to observe this bug is to create some big
filesystems (5-10 of 20Gb each, or more) in parallel, and then fsck'ing them. At
least one of them _will_ be corrupt.

No corruption occurs with 2 Gbytes of RAM.

Comment 16 Need Real Name 2005-07-06 08:52:00 UTC
*** Bug 158169 has been marked as a duplicate of this bug. ***

Comment 17 Ernie Petrides 2005-07-21 21:40:59 UTC
Bug 158169 is against RHEL4, so I've undone the dup.

Also removing from U5 blocker list while I'm at it.

Comment 18 David Kostal 2005-08-31 10:38:08 UTC
I had similar problem on Dell PE 1850 running x86_64 CentOS 3. It was failing
only on one machine from of cca8: when doing rsync of installation from another
machine to this one via network it attempet to write beyond end of device. It
was (partially) solved after BIOS and raid controller (PERC 4i/DC) fw upgrade:
does not crash but  raid init (before OS load) takes cca 10x longer than on
other machines.

Comment 19 Srdanov Djerdj 2005-09-19 23:30:14 UTC
I had recently similar problem on Supermicro X6DAL-G 4GBRAM 2x Nocona 2.8GHz.
My solution was to appen the noapic boot parameter. Any experience with that?

Comment 20 jason andrade 2005-12-12 23:56:49 UTC
we also see this problem using an Acer R510 w/ a LSI sata megaraid controller and 4G of ram.  we are 
testing this by removing 2G of ram.  obviously it's not a good long term solution as we bought a number
of servers with 4G of ram for a reason :-/

this is with RHEL3/WS and now RHEL4/WS.

-jason


Comment 21 jason andrade 2006-03-24 01:41:29 UTC
we've been following this up with our support through our hardware vendor (acer) and i'm putting some 
of the information we supplied them here so other people in their support area can refer to it.

SCSI subsystem initialized
megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03 EST 2005)
megaraid: 2.20.4.6 (Release Date: Mon Mar 07 12:27:22 EST 2005)
megaraid: probe new device 0x1000:0x1960:0x1000:0x0523: bus 3:slot 3:func 0
ACPI: PCI interrupt 0000:03:03.0[A] -> GSI 24 (level, low) -> IRQ 209
megaraid: fw version:[713N] bios version:[G119]
scsi0 : LSI Logic MegaRAID driver
scsi[0]: scanning scsi channel 0 [Phy 0] for non-raid devices
scsi[0]: scanning scsi channel 1 [virtual] for logical drives
  Vendor: MegaRAID  Model: LD 0 RAID5  476G  Rev: 713N
  Type:   Direct-Access                      ANSI SCSI revision: 02
SCSI device sda: 976773120 512-byte hdwr sectors (500108 MB)
sda: asking for cache data failed
sda: assuming drive cache: write through
 sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 >
Attached scsi disk sda at scsi0, channel 1, id 0, lun 0

our current fix is to turn on 'memory mirroring' which is the equivalent of removing 2G of ram from our 
4G systems and the problem has not reoccured since.

-jason


Comment 22 illtud 2006-05-20 15:20:11 UTC
Is this still an issue? We're bringing four x86_64 systems (Sun v40z's) to
production on RHEL 3 with an LSI RAID 5 array. Each system has 20GB of memory.
We haven't seen data corruption yet, but I'd like to know what the status of
this high priority bug is. Data corruption is surely a showstopper.

Comment 23 RHEL Program Management 2007-10-19 19:12:48 UTC
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.


Note You need to log in before you can comment on or make changes to this bug.