Description of problem: I've spent the last two weeks tracking down the cause of data corruption in large files (between 500MB and 4GB) on a pair of SATA hard drives. The symptom is that after copying a file, an MD5 checksum of the copy against the original would show a mismatch. Detailed comparison of the files using 'cmp -l' shows that the differences are clustered into one or more 4K chunks, but almost every byte within these chunks differs from the original. I tested the hardware using Maxtor's PowerMax diagnostics (both a full read and burn-in tests) and memtest86+ (multiple passes running overnight). The hardware tests all passed. I tried updating the hard disk firmware and the motherboard BIOS. That did not solve the problem. I tried running a similar copy test in Windows XP (32-bit, on separate partitions) running cygwin. All files copied correctly. I had both FC5 and FC6 RC2 installed, so I tried copying files under different kernels. In all versions -- 2.6.15-1.2056_FC5, 2.6.17-1.2630.fc6, and 2.6.18-1.2798.fc6 -- I was able to reproduce the problem. I found a linux-kernel message on the web (http://lkml.org/lkml/2006/9/8/289) from someone who has a similar hardware and software setup to mine and was having data corruption problems as well. I wrote to him for a follow-up, and he reports that after upgrading the kernel from 2.6.17-1.2157_FC5 to 2.6.18-rc6, he has not seen any further data corruption, although he admits that they have not done much follow-up testing. I collected MD5 checksums for the individual 4K blocks in the files that had been corrupted (both source and destination) and discovered something very interesting: some of the corrupted blocks in the destination files were found to be an exact MD5 match for *different* blocks in previously copied files! Since the hard drive's cache is only 16MB and the distance between one of the the source blocks and the destination block to which it got incorrectly written was over 751MB, I think that's a clear indication that the kernel is writing the wrong disk buffer out to disk. Most of my system's 4GB of memory is being used for cache (3.4GB). At this point I suspect a race condition may be overwriting buffer pointers. But given the number of times I've copied the same files over and over in testing, it's also possible I'm looking at old sector data in which buffered writes never got flushed out. I also would not rule out the possibility of a hardware problem with DMA transfers, though I have no idea how to test that. Version-Release number of selected component (if applicable): Tested with the following kernels: 2.6.15-1.2056_FC5.x86_64 2.6.17-1.2630.fc6.x86_64 2.6.18-1.2798.fc6.x86_64 How reproducible: I'd estimate 1 bad 4K block in every 200,000 for unbroken streaming writes. I've been able to reproduce the problem fairly consistently, except for this morning after starting the computer up I was able to copy over 23GB of data without any checksum errors. I don't know if that's because the computer is cooler or if there is some other condition to the test that's different. Steps to Reproduce: 1. Create (or download/copy from an external source) one or more files roughly 1GB in size. 2. Copy the file(s) from one hard disk to another. 3. Run md5sum over the original and the copy and compare. Actual results: MD5 checksums are different in 50-75% of the files. A bytewise comparison using 'cmp -l' shows that differences are localized in one or a few 4K chunks. Expected results: MD5 checksums should be equal. Additional info: Motherboard: Tyan Thunder K8WE (S2895A), BIOS upgraded to 1.04 CPU: (2) AMD Opteron 270 HE, 2.0GHz Memory: 4GB I/O Controller: on-board nVidia nForce4 Hard disks: (2) Maxtor DiamondMax 10 300GB SATA, firmware upgraded to BANC1G20
Here are the results of my most recent test. I started with a group of files that I had copied onto both hard drives, and repeatedly re-synced until the checksums matched. I then ran a shell command which went through each file and made one copy from disk A to disk B and another copy from disk B to disk A: $ for file in *.DAT ; do cp -av /diska/$file /diskb/$file.2 ; cp -av /diskb/$file /diska/$file.2 ; done Next, A gathered MD5 checksums for each 4K block in the source and destination files: $ for file in *.DAT *.DAT.2 ; do size=`du $file | cut -f 1` ; size=$((size/4)) ; block=0 ; cat /dev/null > $file.MD5s ; while [ $block -lt $size ] ; do echo -ne "$file: $block\r" ; dd if=$file bs=4096 skip=$block count=1 2>/dev/null | md5sum >> $file.MD5s ; block=$((block+1)) ; done ; echo "$file: finished" ; done Using the .MD5s files I could easily compare chunks in one file to chunks in other files no matter which block they occurred in. I will attach an extraction from these files which show the blocks that differed between the original and copy A or B. In all but 1 case, the block that was corrupted matches a block that came from an earlier file or an earlier block of the same file. The sole exception (copy A, file 3, block 159829) can be explained by the fact that I copied the same from from A to B first, so that whole file would have been cached when copying back from B to A. This still doesn't rule out the possibility that buffers simply aren't being written out. So for my next test, I'm going to erase all of these copies, fill the hard disk (as much as I can) with a simple fixed pattern, and try creating the copies again.
Created attachment 139072 [details] List of MD5 checksums for corrupted blocks
I'm sorry, but after numerous tests in trying to analyze the problem in a more controlled environment, I have been unable to reproduce the file corruption. At the moment my best guess is that between upgrading the hard drive firmware and performing test copies last Friday, I don't think I had power-cycled the computer. All test files have copied without corruption since I switched the computer on yesterday. So I think that the firmware upgrade required a cold restart in order to take effect. I don't understand why that would be the case, since everything I can find on this firmware upgrade seems to indicate it is meant to fix a drive detection problem, not data corruption, and that doesn't explain why the errors are in Linux page sizes. But then Maxtor doesn't say exactly what the difference is between BANC1G10 and BANC1G20.