Red Hat Bugzilla – Bug 229239
Data corruption with multiple SATA devices -- believe it to be x86_64 specific! driver problem
Last modified: 2008-05-06 23:58:09 EDT
Description of problem:
System has: 1 ATA hard drive (hosting FC6 with latest updates)
2 internal SATA hard drives connected to sata_nv on motherboard
1 internal SATA hard drive && 1 external SATA drive (for backups) connected to
sata_sil in pci slot
3 internal SATA's function as raid0 (/dev/md0). Devices connected to the
sata_sil experience data corruption! Observe this session:
I have a large file I have created by cat'ing a bunch of .iso's together (from
/dev/hda -> /dev/md0) File is called "knownquantity.dat".
Here is the size:
[09:58:39 davidb@marvin raid]$ ls -l knownquantity.dat
-rw-rw-r-- 1 davidb davidb 13569177600 Feb 19 09:54 knownquantity.dat
Here is the initial checksum from on the raid:
[09:56:55 davidb@marvin raid]$ md5sum knownquantity.dat
Now I copy it the external SATA backup device (/dev/sdc1) via the sata_sil driver.
[09:59:57 davidb@marvin raid]$ sudo cp -a knownquantity.dat /mnt/backup/
Now, I'm done so I check the size and it looks OK:
[10:03:25 davidb@marvin raid]$ ls -l /mnt/backup/knownquantity.dat
-rw-rw-r-- 1 davidb davidb 13569177600 Feb 19 09:54 /mnt/backup/knownquantity.dat
Now I checksum the copied file to see if it copied correctly:
[10:04:44 davidb@marvin raid]$ md5sum /mnt/backup/knownquantity.dat
Uh oh!! 5a5e6ab2b9be30262768fa8b69cf9c0c != 4f7ff463f0593902e843e53534e258ab
How bad is the problem? Let's try to checksum the file a few more times just on
[10:17:45 davidb@marvin raid]$ md5sum knownquantity.dat
[10:19:03 davidb@marvin raid]$ md5sum knownquantity.dat
[10:21:11 davidb@marvin raid]$ md5sum knownquantity.dat
[10:22:32 davidb@marvin raid]$ md5sum knownquantity.dat
Version-Release number of selected component (if applicable):
Linux marvin 2.6.19-1.2911.fc6 #1 SMP Sat Feb 10 15:16:31 EST 2007 x86_64 x86_64
Steps to Reproduce:
1.Have a big file on an x86_64 machine with the sata_sil device.
2.Copy it to the drive using sata_sil.
3.Check the checksums.
A matching checksum.
The funny thing though, even though these checksums are off, the problem is
probably just a few bytes. I have VMware virtual machines on the raid and they
do function. No, I'm not going to leave it this way -- but that's why this
problem has been hard to find until now.
The RAID0 used to be just two sata drives on the sata_nv and I don't recall
having any trouble. Then I added the sata_sil and the third drive and that's
when weird things started happening. Recognizing the data corruption and
pinpointing it is a relatively recent development.
You should know that I have used an identical sata_sil PCI card and the exact
same external hard drive to back up a 32 bit machine and have experienced no
data corruption. This is verified by the fact that I do my backups using tar
-czf /mnt/backup/file.tgz and then following the backup I have started doing a
tar -tvzf /mnt/backup/file.tgz to validate the file. I successfully backed up
and tested a 161G file, so gzip vouches for the data integrity. There is,
however, only a single SATA controller in that 32 bit machine. So it could be a
64 bit driver issue (likely) or a SATA stack kernel issue (to me, less-likely).
Output of lspci on the machine
00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2)
00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2)
00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2)
00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2)
00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2)
00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2)
00:03.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:05.0 VGA compatible controller: nVidia Corporation C51G [GeForce 6100] (rev a2)
00:09.0 RAM memory: nVidia Corporation MCP51 Host Bridge (rev a2)
00:0a.0 ISA bridge: nVidia Corporation MCP51 LPC Bridge (rev a3)
00:0a.1 SMBus: nVidia Corporation MCP51 SMBus (rev a3)
00:0b.0 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev a1)
00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1)
00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2)
00:10.1 Audio device: nVidia Corporation MCP51 High Definition Audio (rev a2)
00:14.0 Bridge: nVidia Corporation MCP51 Ethernet Controller (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
03:06.0 Mass storage controller: Silicon Image, Inc. SiI 3512
[SATALink/SATARaid] Serial ATA Controller (rev 01)
I wanted to do some additional tests on this problem.
First, I went and created a copy of knownquantity.dat onto /dev/hda (a known
good, reliable basic ata drive). Then I checksummed knownquantity.dat on that
drive. Then I copied it from /dev/hda -> /dev/sdc1 (/mnt/backup in the above
examples; the external SATA drive on sata_sil). Then checksummed it from the
external drive. Much to my surprise, it checked out. So I did it two more
times (copying over top of the same file). It checked out the other two times.
Then I said: let's sanity check sata_nv. So I went and dropped /dev/md0 and
re-created it with only the 2 sata drives (/dev/sda, /dev/sdb). Then I copied
knownquantity.dat from /dev/hda to the new /dev/md0. then checksummed it, it
checked out. then I checksummed it again a couple of times. Again, it checked out.
Then I re-created /dev/md0 again with the third internal drive running on
sata_sil, re-copied knownquantity.dat onto it. Immediately after the copy, the
md5sum doesn't match.
I am willing to consider 3 device raid0 as part of the culprit. I will try the
test again with only one drive from sata_nv and one drive from sata_sil. I am
betting I will experience the same problem, however.
It looks at this point like it is related to multiple sata drivers or how the
sata_sil drivers handle being in a raid0.
How much memory is installed on this machine? If it's >4Gb or you
have a memory hole there can be problems. You can use the boot
to work around this (edit /etc/grub.conf and add it after the
"root=" entry for your kernel.)
Also, corruption like this can be caused by not enough
power getting to the drives. Your power supply may not be
big enough to drive them, especially when they are all
doing I/O at once.
Those are some insightful questions.
cat /proc/meminfo gives this number:
MemTotal: 3972512 kB
I did have to enable "HardWare Memory Hole" in order to see the full 4GB. So I
will go try your kernel option.
As far as a power supply, I have a 550W. I would have thought that would be
enough. 4 internal drives, the external has an external power source.
Let me see my mileage on the memory hole option and I will get back with you.
makes my md5sum's work! :)
Thank you very much for your help! You are a star!
probably a duplicate of bug 223238. We have a patch as of yesterday that fixes
the problem, but the root cause appears (at the moment) to be a silicon erratum.
This problem seems to be associated with the Nvidia chipsets. What was the
exact platform that was having the problem (Vendor/model, etc).
Fedora Core 5 and Fedora Core 6 are, as we're sure you've noticed, no longer
test releases. We're cleaning up the bug database and making sure important bug
reports filed against these test releases don't get lost. It would be helpful if
you could test this issue with a released version of Fedora or with the latest
development / test release. Thanks for your help and for your patience.
[This is a bulk message for all open FC5/FC6 test release bugs. I'm adding
myself to the CC list for each bug, so I'll see any comments you make after this
and do my best to make sure every issue gets proper attention.]
MSI K8NGM2-L is the motherboard manufacturer. I think it is the nforce 410, I
believe is the chipset.
I have another confession, however. I didn't realize it but I think I had an
overclocked northbridge to southbridge link. Running with iommu=soft made
everything run smooth, with the exception that I would have the
occasional "lockup" on IO and would have to hard-reset the system. Then when I
went into the BIOS and did "Load Optimized Defaults", it changed my northbridge
to southbridge link from 1000Mhz to 800Mhz.
Since I did that, I have not had a single "lockup" and everything appears to be
much more stable. I still have iommu=soft.
I am willing to take iommu off to test, if there has been some fixes.
(In reply to comment #8)
> This problem seems to be associated with the Nvidia chipsets. What was the
> exact platform that was having the problem (Vendor/model, etc).
Bug 223238 is Top Sekret, so I can't mark this as a dupe of that.
Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.
If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)
Thanks for your help, and we apologize again that we haven't handled
these issues to this point.
The process we're following is outlined here:
We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.
This bug has been in NEEDINFO for more than 30 days since feedback was
first requested. As a result we are closing it.
If you can reproduce this bug in the future against a maintained Fedora
version please feel free to reopen it against that version.
The process we're following is outlined here:
This bug may be closed. The issue was resolved with a fixed driver that is in
the mainline kernel, I believe it went in in 18.104.22.168 or something like that.
It has been a long time.
The problem was caused by an out of spec driver for the nforce chip. You can
read about it in the kernel changelogs if you care, but this issue really is closed.