Bug 229239
Summary: | Data corruption with multiple SATA devices -- believe it to be x86_64 specific! driver problem | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | David Bennion <davidbennion> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | rawhide | CC: | mattdm, triage, wtogami |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | bzcl34nup | ||
Fixed In Version: | FC7 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-05-07 01:12:20 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Bennion
2007-02-19 18:16:27 UTC
Output of lspci on the machine 00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2) 00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2) 00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2) 00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2) 00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2) 00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2) 00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2) 00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2) 00:03.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1) 00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1) 00:05.0 VGA compatible controller: nVidia Corporation C51G [GeForce 6100] (rev a2) 00:09.0 RAM memory: nVidia Corporation MCP51 Host Bridge (rev a2) 00:0a.0 ISA bridge: nVidia Corporation MCP51 LPC Bridge (rev a3) 00:0a.1 SMBus: nVidia Corporation MCP51 SMBus (rev a3) 00:0b.0 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) 00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) 00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev a1) 00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1) 00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2) 00:10.1 Audio device: nVidia Corporation MCP51 High Definition Audio (rev a2) 00:14.0 Bridge: nVidia Corporation MCP51 Ethernet Controller (rev a3) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 03:06.0 Mass storage controller: Silicon Image, Inc. SiI 3512 [SATALink/SATARaid] Serial ATA Controller (rev 01) I wanted to do some additional tests on this problem. First, I went and created a copy of knownquantity.dat onto /dev/hda (a known good, reliable basic ata drive). Then I checksummed knownquantity.dat on that drive. Then I copied it from /dev/hda -> /dev/sdc1 (/mnt/backup in the above examples; the external SATA drive on sata_sil). Then checksummed it from the external drive. Much to my surprise, it checked out. So I did it two more times (copying over top of the same file). It checked out the other two times. Then I said: let's sanity check sata_nv. So I went and dropped /dev/md0 and re-created it with only the 2 sata drives (/dev/sda, /dev/sdb). Then I copied knownquantity.dat from /dev/hda to the new /dev/md0. then checksummed it, it checked out. then I checksummed it again a couple of times. Again, it checked out. Then I re-created /dev/md0 again with the third internal drive running on sata_sil, re-copied knownquantity.dat onto it. Immediately after the copy, the md5sum doesn't match. I am willing to consider 3 device raid0 as part of the culprit. I will try the test again with only one drive from sata_nv and one drive from sata_sil. I am betting I will experience the same problem, however. It looks at this point like it is related to multiple sata drivers or how the sata_sil drivers handle being in a raid0. How much memory is installed on this machine? If it's >4Gb or you have a memory hole there can be problems. You can use the boot option iommu=soft to work around this (edit /etc/grub.conf and add it after the "root=" entry for your kernel.) Also, corruption like this can be caused by not enough power getting to the drives. Your power supply may not be big enough to drive them, especially when they are all doing I/O at once. Those are some insightful questions. cat /proc/meminfo gives this number: MemTotal: 3972512 kB I did have to enable "HardWare Memory Hole" in order to see the full 4GB. So I will go try your kernel option. As far as a power supply, I have a 550W. I would have thought that would be enough. 4 internal drives, the external has an external power source. Let me see my mileage on the memory hole option and I will get back with you. iommu=soft makes my md5sum's work! :) Thank you very much for your help! You are a star! probably a duplicate of bug 223238. We have a patch as of yesterday that fixes the problem, but the root cause appears (at the moment) to be a silicon erratum. Chip This problem seems to be associated with the Nvidia chipsets. What was the exact platform that was having the problem (Vendor/model, etc). Chip Fedora Core 5 and Fedora Core 6 are, as we're sure you've noticed, no longer test releases. We're cleaning up the bug database and making sure important bug reports filed against these test releases don't get lost. It would be helpful if you could test this issue with a released version of Fedora or with the latest development / test release. Thanks for your help and for your patience. [This is a bulk message for all open FC5/FC6 test release bugs. I'm adding myself to the CC list for each bug, so I'll see any comments you make after this and do my best to make sure every issue gets proper attention.] MSI K8NGM2-L is the motherboard manufacturer. I think it is the nforce 410, I believe is the chipset. I have another confession, however. I didn't realize it but I think I had an overclocked northbridge to southbridge link. Running with iommu=soft made everything run smooth, with the exception that I would have the occasional "lockup" on IO and would have to hard-reset the system. Then when I went into the BIOS and did "Load Optimized Defaults", it changed my northbridge to southbridge link from 1000Mhz to 800Mhz. Since I did that, I have not had a single "lockup" and everything appears to be much more stable. I still have iommu=soft. I am willing to take iommu off to test, if there has been some fixes. (In reply to comment #8) > This problem seems to be associated with the Nvidia chipsets. What was the > exact platform that was having the problem (Vendor/model, etc). > Chip Bug 223238 is Top Sekret, so I can't mark this as a dupe of that. Based on the date this bug was created, it appears to have been reported against rawhide during the development of a Fedora release that is no longer maintained. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained. If this bug remains in NEEDINFO thirty (30) days from now, we will automatically close it. If you can reproduce this bug in a maintained Fedora version (7, 8, or rawhide), please change this bug to the respective version and change the status to ASSIGNED. (If you're unable to change the bug's version or status, add a comment to the bug and someone will change it for you.) Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we're following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. This bug has been in NEEDINFO for more than 30 days since feedback was first requested. As a result we are closing it. If you can reproduce this bug in the future against a maintained Fedora version please feel free to reopen it against that version. The process we're following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp This bug may be closed. The issue was resolved with a fixed driver that is in the mainline kernel, I believe it went in in 2.6.3.21 or something like that. It has been a long time. The problem was caused by an out of spec driver for the nforce chip. You can read about it in the kernel changelogs if you care, but this issue really is closed. |