Description of problem: System has: 1 ATA hard drive (hosting FC6 with latest updates) 2 internal SATA hard drives connected to sata_nv on motherboard 1 internal SATA hard drive && 1 external SATA drive (for backups) connected to sata_sil in pci slot 3 internal SATA's function as raid0 (/dev/md0). Devices connected to the sata_sil experience data corruption! Observe this session: I have a large file I have created by cat'ing a bunch of .iso's together (from /dev/hda -> /dev/md0) File is called "knownquantity.dat". Here is the size: [09:58:39 davidb@marvin raid]$ ls -l knownquantity.dat -rw-rw-r-- 1 davidb davidb 13569177600 Feb 19 09:54 knownquantity.dat Here is the initial checksum from on the raid: [09:56:55 davidb@marvin raid]$ md5sum knownquantity.dat 4f7ff463f0593902e843e53534e258ab knownquantity.dat Now I copy it the external SATA backup device (/dev/sdc1) via the sata_sil driver. [09:59:57 davidb@marvin raid]$ sudo cp -a knownquantity.dat /mnt/backup/ Now, I'm done so I check the size and it looks OK: [10:03:25 davidb@marvin raid]$ ls -l /mnt/backup/knownquantity.dat -rw-rw-r-- 1 davidb davidb 13569177600 Feb 19 09:54 /mnt/backup/knownquantity.dat Now I checksum the copied file to see if it copied correctly: [10:04:44 davidb@marvin raid]$ md5sum /mnt/backup/knownquantity.dat 5a5e6ab2b9be30262768fa8b69cf9c0c /mnt/backup/knownquantity.dat Uh oh!! 5a5e6ab2b9be30262768fa8b69cf9c0c != 4f7ff463f0593902e843e53534e258ab How bad is the problem? Let's try to checksum the file a few more times just on the raid. [10:17:45 davidb@marvin raid]$ md5sum knownquantity.dat 0ce17ed09df818197a8e707d1833177d knownquantity.dat [10:19:03 davidb@marvin raid]$ md5sum knownquantity.dat 5357648fc4600c7e771057170a1fa407 knownquantity.dat [10:21:11 davidb@marvin raid]$ md5sum knownquantity.dat f16ddd9a507d3543afc4b995adf5348d knownquantity.dat [10:22:32 davidb@marvin raid]$ md5sum knownquantity.dat 87ddff0ef0a695c46f718a9463014352 knownquantity.dat Version-Release number of selected component (if applicable): Linux marvin 2.6.19-1.2911.fc6 #1 SMP Sat Feb 10 15:16:31 EST 2007 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Very. Steps to Reproduce: 1.Have a big file on an x86_64 machine with the sata_sil device. 2.Copy it to the drive using sata_sil. 3.Check the checksums. Actual results: Non-matching checksum. Expected results: A matching checksum. Additional info: The funny thing though, even though these checksums are off, the problem is probably just a few bytes. I have VMware virtual machines on the raid and they do function. No, I'm not going to leave it this way -- but that's why this problem has been hard to find until now. The RAID0 used to be just two sata drives on the sata_nv and I don't recall having any trouble. Then I added the sata_sil and the third drive and that's when weird things started happening. Recognizing the data corruption and pinpointing it is a relatively recent development. You should know that I have used an identical sata_sil PCI card and the exact same external hard drive to back up a 32 bit machine and have experienced no data corruption. This is verified by the fact that I do my backups using tar -czf /mnt/backup/file.tgz and then following the backup I have started doing a tar -tvzf /mnt/backup/file.tgz to validate the file. I successfully backed up and tested a 161G file, so gzip vouches for the data integrity. There is, however, only a single SATA controller in that 32 bit machine. So it could be a 64 bit driver issue (likely) or a SATA stack kernel issue (to me, less-likely).
Output of lspci on the machine 00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2) 00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2) 00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2) 00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2) 00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2) 00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2) 00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2) 00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2) 00:03.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1) 00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1) 00:05.0 VGA compatible controller: nVidia Corporation C51G [GeForce 6100] (rev a2) 00:09.0 RAM memory: nVidia Corporation MCP51 Host Bridge (rev a2) 00:0a.0 ISA bridge: nVidia Corporation MCP51 LPC Bridge (rev a3) 00:0a.1 SMBus: nVidia Corporation MCP51 SMBus (rev a3) 00:0b.0 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) 00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) 00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev a1) 00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1) 00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2) 00:10.1 Audio device: nVidia Corporation MCP51 High Definition Audio (rev a2) 00:14.0 Bridge: nVidia Corporation MCP51 Ethernet Controller (rev a3) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 03:06.0 Mass storage controller: Silicon Image, Inc. SiI 3512 [SATALink/SATARaid] Serial ATA Controller (rev 01)
I wanted to do some additional tests on this problem. First, I went and created a copy of knownquantity.dat onto /dev/hda (a known good, reliable basic ata drive). Then I checksummed knownquantity.dat on that drive. Then I copied it from /dev/hda -> /dev/sdc1 (/mnt/backup in the above examples; the external SATA drive on sata_sil). Then checksummed it from the external drive. Much to my surprise, it checked out. So I did it two more times (copying over top of the same file). It checked out the other two times. Then I said: let's sanity check sata_nv. So I went and dropped /dev/md0 and re-created it with only the 2 sata drives (/dev/sda, /dev/sdb). Then I copied knownquantity.dat from /dev/hda to the new /dev/md0. then checksummed it, it checked out. then I checksummed it again a couple of times. Again, it checked out. Then I re-created /dev/md0 again with the third internal drive running on sata_sil, re-copied knownquantity.dat onto it. Immediately after the copy, the md5sum doesn't match. I am willing to consider 3 device raid0 as part of the culprit. I will try the test again with only one drive from sata_nv and one drive from sata_sil. I am betting I will experience the same problem, however. It looks at this point like it is related to multiple sata drivers or how the sata_sil drivers handle being in a raid0.
How much memory is installed on this machine? If it's >4Gb or you have a memory hole there can be problems. You can use the boot option iommu=soft to work around this (edit /etc/grub.conf and add it after the "root=" entry for your kernel.) Also, corruption like this can be caused by not enough power getting to the drives. Your power supply may not be big enough to drive them, especially when they are all doing I/O at once.
Those are some insightful questions. cat /proc/meminfo gives this number: MemTotal: 3972512 kB I did have to enable "HardWare Memory Hole" in order to see the full 4GB. So I will go try your kernel option. As far as a power supply, I have a 550W. I would have thought that would be enough. 4 internal drives, the external has an external power source. Let me see my mileage on the memory hole option and I will get back with you.
iommu=soft makes my md5sum's work! :)
Thank you very much for your help! You are a star!
probably a duplicate of bug 223238. We have a patch as of yesterday that fixes the problem, but the root cause appears (at the moment) to be a silicon erratum. Chip
This problem seems to be associated with the Nvidia chipsets. What was the exact platform that was having the problem (Vendor/model, etc). Chip
Fedora Core 5 and Fedora Core 6 are, as we're sure you've noticed, no longer test releases. We're cleaning up the bug database and making sure important bug reports filed against these test releases don't get lost. It would be helpful if you could test this issue with a released version of Fedora or with the latest development / test release. Thanks for your help and for your patience. [This is a bulk message for all open FC5/FC6 test release bugs. I'm adding myself to the CC list for each bug, so I'll see any comments you make after this and do my best to make sure every issue gets proper attention.]
MSI K8NGM2-L is the motherboard manufacturer. I think it is the nforce 410, I believe is the chipset. I have another confession, however. I didn't realize it but I think I had an overclocked northbridge to southbridge link. Running with iommu=soft made everything run smooth, with the exception that I would have the occasional "lockup" on IO and would have to hard-reset the system. Then when I went into the BIOS and did "Load Optimized Defaults", it changed my northbridge to southbridge link from 1000Mhz to 800Mhz. Since I did that, I have not had a single "lockup" and everything appears to be much more stable. I still have iommu=soft. I am willing to take iommu off to test, if there has been some fixes. (In reply to comment #8) > This problem seems to be associated with the Nvidia chipsets. What was the > exact platform that was having the problem (Vendor/model, etc). > Chip
Bug 223238 is Top Sekret, so I can't mark this as a dupe of that.
Based on the date this bug was created, it appears to have been reported against rawhide during the development of a Fedora release that is no longer maintained. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained. If this bug remains in NEEDINFO thirty (30) days from now, we will automatically close it. If you can reproduce this bug in a maintained Fedora version (7, 8, or rawhide), please change this bug to the respective version and change the status to ASSIGNED. (If you're unable to change the bug's version or status, add a comment to the bug and someone will change it for you.) Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we're following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again.
This bug has been in NEEDINFO for more than 30 days since feedback was first requested. As a result we are closing it. If you can reproduce this bug in the future against a maintained Fedora version please feel free to reopen it against that version. The process we're following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp
This bug may be closed. The issue was resolved with a fixed driver that is in the mainline kernel, I believe it went in in 2.6.3.21 or something like that. It has been a long time. The problem was caused by an out of spec driver for the nforce chip. You can read about it in the kernel changelogs if you care, but this issue really is closed.