From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3) Gecko/20040922 Description of problem: During some large file tranfers w/ rsync, I noticed that some files appeared to have been corrupted. After some testing, I have narrowed the problem down to the meagraid2 driver. I am running 2.4.21-20.ELsmp kernel on dual Xeon box w/ a MegaRaid 150-6 SATA controller (3x 250GB drives set up as one RAID 5 device). The filesystem is ext3. For my test, I used one of the user home dirs which contained about 12GB of data. When I rsync'd the dir to another location on disk, some files would be corrupted. I determined this by running a diff on the two directories. For some of the files that reported differences, I would compute md5 sum for the original file and also the copy. The two values differed. Then I ran debugfs and manually travered filesystem and dumped the "copy" to a third location. Upon computing the md5 sum of this third version, it matched the original. This seemed to indicate that the on-disk copy was accurate, but the process of reading the file for the diff (and ensuing md5) caused bad data to be returned. I unmounted and then remounted the filesystem. The original file and the copy now matched. However, for some files, the copy did genuinely differ from the original. Presumably this was because of bad reads during the initial rsync. Everytime I ran this test, I saw the same behavior. When I tried the rsync with smaller dirs (like linux source code), it would sometimes work and sometimes it would corrupt files. I tried the megaraid2, megaraid_2101, and megaraid_2009 modules all with the same result. However, using just the megaraid module (v1.18 I believe) the rsync test succeeded without a single file corruption. Version-Release number of selected component (if applicable): kernel-2.4.21-20.ELsmp How reproducible: Always Steps to Reproduce: 1. rsync user home dir (~12GB) to another location on RAID partition. 2. Run diff to compare original and copy. 3. Actual Results: Some files are corrupted during the transfer. Expected Results: All files should be identical. Additional info:
This is the first report of a probem like this. I will try to reproduce it. Please post a sysreport so I can match your config as close as possible. I'd also like to check your firmware rev. Run the sysreport after you run the rsync test and detect corruption, so I can chack for any errors in the log.
Created attachment 107857 [details] sysreport output
For the LSI MegaRaid 150-6 card: Firmware = 713G BIOS = G117 Let me know if you need other info.
These messages from /var/log/messages appear to be relevant: Dec 3 09:48:13 nfs2 kernel: attempt to access beyond end of device Dec 3 09:48:13 nfs2 kernel: 08:07: rw=0, want=1028465668, limit=454792086 Dec 3 09:48:13 nfs2 kernel: attempt to access beyond end of device Dec 3 09:48:13 nfs2 kernel: 08:07: rw=0, want=1424023736, limit=454792086 Dec 3 09:48:13 nfs2 kernel: attempt to access beyond end of device Dec 3 09:48:13 nfs2 kernel: 08:07: rw=0, want=1028465668, limit=454792086 Can you confirm that 1. these occur during the timeframe of the corruption 2. there are messages like this each time you do the test and see corruption 3. you are copying to /dev/sda7 (from root on /dev/sda3?) Thanks, Tom
I reran the rsync test again today, and noted the start and stop times so I could compare them with the system logs. The rsync test failed of course, and in /var/log/messages there was one entry (like those above) during the timeframe of the test. After the rsync, I ran a diff to compare the files, and during that another entry appeared in the logs. So the above errors occurred exactly twice: once during rsync and once during diff. I cannot verify that the messages appear every time since I had not been looking at the log after every test run. As far as the devices go, I am copying from /dev/sda7 to /dev/sda7. The /dev/sda3 device is not part of the test. I also checked the status of the raid device just to make sure the errors weren't the result of a degraded array. The array appears to be perfectly fine and its state is shown as "optimal".
We have run two megaraid2 systems for several days without being able to reproduce the problem. So, we would like to ask your help running a debug kernel in order to hopefully get more information about the failure. The kernel at: http://people.redhat.com/coughlan/RHEL3-io-debug/kernel-smp-2.4.21-27.3.EL_io_debug.i686.rpm is a snapshot of our latest development kernel. It has all the changes in the Update 4 kernel (being released today), plus some additional debug aids turned on. Of particular note are the following: - There is an updated megaraid2 driver. - The system will BUG() if it detects "attempt to access beyond end of device" - Slab debug and spinlock debug aids are turned on. Please set up a serial console line so that you capture the full stack trace etc. if the systems crashes. If this does not turn out to be enough info, we may ask you to try to capture a disk or net dump. Thanks for your help. Tom
Well, we sent one system back to the vendor for a fix to this problem. Their solution was to replace the LSI card with one from 3Ware. So now I no longer have a test system to run the debug kernel. If I get a chance, I might try to set up the remaining machine to run the debug kernel before I need to ship it back to the vendor. Otherwise, this bug may end up going unresolved.
I was not able to run any additional testing before the last server was sent back for repairs. So I am afraid that I cannot provide any further information to help out.
Hello, we have the same problem here with the megaraid2 driver version 2.10.8.2-RH1 provided with the RHEL 3.0 update 4 (2.4.21-27.0.2 x86_64 SMP). Our platform is an Intel SE7525GP2 (Bios P07) motherboard with 2 dual Nocona coupled with a Intel SRCS16 Raid card (rebranded LSI 150-6 Board, BIOS G401,Firmware 713N ). The problem : We have configured two RAID 1 volumes, each consisting of 250GB SATA disks. When we made an installation with a big "/" partition (230GB), it completed correctly but just after the first reboot, some random files have been corrupted. FSCK complain that the system wasn't shutdown correctly and run endlessly because it found errors. It is always reproductible. Using a small partition, 8GB in our test , seems to "solve" the problem. If you need more information let me know.
Update : I have to share some new discovery : - The hardware has been changed and works perfectly fine. - The corruption happened only on the 1st volume (/dev/sda), whatever it is (a single disk (JBOD) or a RAID 1 volume). - The megaraid or megaraid2 drivers produce the same results. - The RH3 x86 works fine. Should I open a different bug has it is a little different ?
Please provide more details: 1. Exactly what hardware was changed? For example, did you replace the HBA with the same make/model/revision and now the problem is gone? 2. You implied that the size of the volume may be a factor. When you switched from RAID to JBOD, did the size remain the same? 3. The failure is always on the first volume. Is that always the volume that has the o.s. installed in it? 4. You switched from the x86_64 SMP kernel to the x86 (SMP?) kernel on the same system, and there is no failure, is that right? Are there any messages in the log? In particular, please search for "attempt to access beyond end of device" messages, as seen in the earlier report. It would be best to post a sysreport for the failing system. Tom
1. We tried to replace the HBA with the same model, the hard drives with the same hard drives. The point was to be sure that the hardware is not faulty.The problem was still present. We will try with another storage controller as soon as we have one. 2. Yes I used a "/" of 230GB and 2GB of swap in both cases 3. No I installed the OS on an other volume (sdb in this case) , and the installation went fine. But as soon as I transfered data on sda, it was corrupted. (around 40GB in my tests) 4. Yes x86 SMP. No problems anymore. 5. No message in syslog Update : The installation went fine with 2GB of memory (2 working modules of 1GB) on x86_64, when we plugged the 4GB (2 more working modules) corruption again... To sum-up : the system has 4GB of physical memory, if we activate the BIOS option "remap memory" (which remaps 1GB of memory beyond the 4GB because of the overlap with the PCI register adresses) the corruption occurs. If we don't use the previous option, the bios and linux saw 3GB of memory and no corruption occurs. I'll do more investigation and let you know. All suggestions are welcomed. Thank you for your time. Thomas
Update : We tried few other storage controllers (3ware and LSI Scsi) that use different kernel driver (3w-9xxx for example), and no corruption occurs. So imho, the problem is related to megaraid2 working in x86_64 mode ( megaraid as well). What we need to clarify : - why it's not happenning when we use the x86 SMP version - why the corruption occurs at the first reboot and not during the installation process.(different usage of memory ?) I'll have to deliver the system to the client today, but I'll build another in 1-2 days to investigate and solve this problem because it is not a good solution for the client to loose 1 GB of memory :) Thomas.
I can confirm this happens with megaraid/megaraid2/megaraid_mbox drivers if (and only if, according to my tests) the system has 4 Gb of memory (or posibly more). The size of the partition doesn't seem to matter, as neither does 32 or 64 bits, not RHEL3 or RHEL4. It is just a matter of time for the corruption to occur, as long as there are 4 Gb of RAM. From my experience, the best way to observe this bug is to create some big filesystems (5-10 of 20Gb each, or more) in parallel, and then fsck'ing them. At least one of them _will_ be corrupt. No corruption occurs with 2 Gbytes of RAM.
*** Bug 158169 has been marked as a duplicate of this bug. ***
Bug 158169 is against RHEL4, so I've undone the dup. Also removing from U5 blocker list while I'm at it.
I had similar problem on Dell PE 1850 running x86_64 CentOS 3. It was failing only on one machine from of cca8: when doing rsync of installation from another machine to this one via network it attempet to write beyond end of device. It was (partially) solved after BIOS and raid controller (PERC 4i/DC) fw upgrade: does not crash but raid init (before OS load) takes cca 10x longer than on other machines.
I had recently similar problem on Supermicro X6DAL-G 4GBRAM 2x Nocona 2.8GHz. My solution was to appen the noapic boot parameter. Any experience with that?
we also see this problem using an Acer R510 w/ a LSI sata megaraid controller and 4G of ram. we are testing this by removing 2G of ram. obviously it's not a good long term solution as we bought a number of servers with 4G of ram for a reason :-/ this is with RHEL3/WS and now RHEL4/WS. -jason
we've been following this up with our support through our hardware vendor (acer) and i'm putting some of the information we supplied them here so other people in their support area can refer to it. SCSI subsystem initialized megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03 EST 2005) megaraid: 2.20.4.6 (Release Date: Mon Mar 07 12:27:22 EST 2005) megaraid: probe new device 0x1000:0x1960:0x1000:0x0523: bus 3:slot 3:func 0 ACPI: PCI interrupt 0000:03:03.0[A] -> GSI 24 (level, low) -> IRQ 209 megaraid: fw version:[713N] bios version:[G119] scsi0 : LSI Logic MegaRAID driver scsi[0]: scanning scsi channel 0 [Phy 0] for non-raid devices scsi[0]: scanning scsi channel 1 [virtual] for logical drives Vendor: MegaRAID Model: LD 0 RAID5 476G Rev: 713N Type: Direct-Access ANSI SCSI revision: 02 SCSI device sda: 976773120 512-byte hdwr sectors (500108 MB) sda: asking for cache data failed sda: assuming drive cache: write through sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 > Attached scsi disk sda at scsi0, channel 1, id 0, lun 0 our current fix is to turn on 'memory mirroring' which is the equivalent of removing 2G of ram from our 4G systems and the problem has not reoccured since. -jason
Is this still an issue? We're bringing four x86_64 systems (Sun v40z's) to production on RHEL 3 with an LSI RAID 5 array. Each system has 20GB of memory. We haven't seen data corruption yet, but I'd like to know what the status of this high priority bug is. Data corruption is surely a showstopper.
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.