Bug 135017
Summary: | Data corruption in memory mapped file on SATA drive | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Eric J Korpela <korpela> |
Component: | kernel | Assignee: | Jeff Garzik <jgarzik> |
Status: | CLOSED NOTABUG | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | adolfo, jgarzik, korpela, nfaerber, peterm, petrides, ppokorny, riel |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-03-02 20:06:01 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Eric J Korpela
2004-10-08 01:06:05 UTC
Adding jgarzik to cc: list Jeff - does this tickle any memories or suggest any obvious places to go look? Whole pages getting corrupted I can understand, but cache lines are just a bit odd to me... Has this been reproduced on more than one machine? I ask because it smells like bad RAM or bad cache RAM to me. memtest86 reports no problems after several runs. Stand alone disk tests run overnight with no errors. The problem cannot be reproduced on SCSI disks on the same machine. It appears to definitely be related to a page being written to disk. The only thing I can think of would be something related to L1 or L2 cache not being fully flushed to the main RAM before a page is written to disk. I don't know enough about linux device drivers and the smp kernel to know if this is possible. I only have one machine to test on at present. Any suggestions as to where I could get access to a similar machine? Do you get the same results if you limit the memory to 4G (i.e. boot with "mem=4G")? This would suggest whether it may be an IOMMU issue... Actually mem=1G would probably be better test (but in general, I agree w/ Jim's comment #5) I'm out of the office today, but I will try to get back in to test it ASAP. Using mem=1G did infact prevent the problem from occurring. One thing I do now notice is that even though IOMMU is enabled in the BIOS, I get messages like the following in the boot log. Oct 10 14:57:49 zork kernel: Checking aperture... Oct 10 14:57:49 zork kernel: CPU 0: aperture @ 0 size 32768 KB Oct 10 14:57:49 zork kernel: Your BIOS doesn't leave a aperture memory hole Oct 10 14:57:49 zork kernel: Please enable the IOMMU option in the BIOS setup Oct 10 14:57:49 zork kernel: Mapping aperture over 65536 KB of RAM @ 8000000 and elsewhere Oct 10 14:57:50 zork kernel: PCI-DMA: aperture base @ 8000000 size 65536 KB Oct 10 14:57:50 zork kernel: PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture Since last report the drive had gotten corrupted enough that I needed to reformat and reinstall. Additional potential hints... mem=4G causes a kernel panic for the SMP kernel, but is OK with non-SMP. mem=4G-64M is OK. iommu=off causes a kernel panic. iommu=merge causes no change in errors iommu=fullflush also causes no change I have seen an extreme case of this that I believe is related. With Western Digital drives and more than 4GB of RAM, the system will suffer extreme file corruption. In most cases, the RHEL 3 Update 3 install will appear to succeed but upon reboot, a lot of file corruption occurs and eventually renders the system useless. I have seen the problem on three different configs that had only two things in common: Tyan S2885 w/ onboard SiI3114 SATA and a Western Digital drive. I have checked with Eric and his drive is also Western Digital. All my RAM configurations passed memtest86+: 8x 1GB ATP 8x 1GB Corsair 8x 2GB ATP (and 4x of the same 2GB ATP) We have used three different S2885 motherboards each with a different video card. One system had an add-in sound card. One had an add-in 3ware SATA RAID card. One had no add-in PCI cards. Other drives (Segate non-blacklist and Maxtor) do not suffer from the extreme (can't reboot after install) case of this problem. We are working to determine whether these drives suffer from the more subtle case exhibited but Eric's C program. The same Western Digital drive will not show the extreme case when attached to a 3ware RAID card. The add-in SIIG 3114 does not suffer from the extreme (can't reboot after install) case of this issue. We are checking to see if it passes Eric's program. The Western Digital drives that suffer extreme failure: WD360GD-00FNA0 (WD360 Raptor) WD2500JD-55HBB0 WD2500JD-00HBB0 WD1600JD-00HBB0 The on-board Silicon Image 3114 controller is on the 32-bit/33MHz PCI bus from the AMD-8111 south bridge. Add-in 3Ware and SIIG controllers were probably plugged into a 64-bit slot on the AMD-8131 PCI-X bridge. Could that difference be important? I concur with the comment that is is restricted to Western Digital drives. Replacement of the WD drive with a Seagate of equivalent capacity has solved the problem on the server where it was initially reported. This is the "SATA 4GB boundary corruption" problem, which was recently fixed. Can you provide more information on this "4GB boundary corruption"? Is there another bugzilla tracking that problem? Western Digital, Tyan and Silicon Image were able to reproduce the problem and WD reported that Silicon Image said there was an issue with the 3114 chip and memory accesses. Silicon Image and Tyan released a new BIOS for the motherboard (with new 3114 BIOS code) that solved the problem in the test system. On x86-64 (EM64T only) and >= 4GB of memory, memory corruption would occur. However, looking at the bug report again, I see that it's AMD64 not EM64T. Nonetheless, you say a new BIOS fixed things, so I'll leave it closed. Since this seems to have been a BIOS/firmware issue, I'm closing it as NOTABUG (not a kernel bug, that is). For closure, the specific version of Silicon Image Option ROM BIOS code needed is: Silicon Image Oprom v5.0.48 Tyan released new BIOS (Feb, 2005) for the S2885 and S4882 with that version of the Option ROM. The S2882 motherboard has a BIOS with 5.0.44 which may also be OK? |