Red Hat Bugzilla – Bug 158685
VIA KT133A 686B quirk_vialatency() no longer enough to avoid data corruption
Last modified: 2007-11-30 17:11:06 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.8) Gecko/20050512 Fedora/1.0.4-2 Firefox/1.0.4
Description of problem:
There's a disk server at the uni that started displaying rare disk corruption problems as of some months ago.
The MoBo, an Asus A7V133 with 2 VIA IDE channels and 2 Promise IDE channels, has a known bug that causes disk corruption. It was possible, however, to minimize the problem through some BIOS settings, and Linux itself gained the ability to remove the problem with quirk_vialatency().
Although I see in dmesg the message that this function prints when it enables the work around, I'm still seeing disk corruption.
Some time ago, I had two big disks in RAID 1 on the VIA channels, holding system images, home dirs, and a small part of a filesystem holding mirrors of external locations such as download.fedora.redhat.com, and two smaller disks in RAID 1 on the Promise channels, holding the rest of the mirrors filesystem.
As of a few months ago, I started noticing a tendency for rsync to re-fetch big ISO files when I transferred new test releases, after priming the ISOs from our local copy of rawhide. The re-fetch was because the files didn't match their expected checksums.
After checking that the BIOS settings were correct, and that the BIOS was the latest available, I figured I'd try rearranging the disks a bit. I added a Sil680 ATA133 RAID controller to the box, moved the smaller disks to it, and moved the bigger disks to the Promise controller. The problems remained.
But this time I actually looked into the corruption pattern, just to make sure it wasn't memory going bad (we had run a memtest recently, but not for very long, since the box is the main server of the lab)
The corruption pattern I got when copying one big file from the Promise controller to the Sil680 controller was that a pair of 4Kb pages from the original file ended up in two different locations in the copy: the original location, plus another location that, unlike the original, was aligned to an 8Kb boundary. I.e., the commands:
cmp <(dd if=$copy bs=4k skip=22119 count=2) <(dd if=$orig bs=4k skip=22119 count=2)
cmp <(dd if=$copy bs=4k skip=94534 count=2) <(dd if=$orig bs=4k skip=22119 count=2)
succeeded, whereas this one failed, confirming that the duplicate was not present in the original:
cmp <(dd if=$orig bs=4k skip=94534 count=2) <(dd if=$orig bs=4k skip=22119 count=2)
FWIW, the original failure mode for this VIA quirk I was familiar with was a corruption of 31 bytes out of a 32-byte block, the first byte being correct, the maining being copied from the corresponding locations of another 32-byte block, generally in the file being copied.
Although this could still be a symptom of the same bug, I can't tell for sure. It might as well have something to do with RAID, page table management, or anything, really :-(
On the good side, I've got a very similar box at home, that also displayed the original quirk problem, and enough disks I could play with to try to trigger the bug myself, but I could use suggestions on what to try. I haven't observed the problem in the recent past, but then I don't use that box much any more.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.copy a big (100MB+) file across IDE controllers
2.compare the original with the copy
Actual Results: They sometimes differ
Expected Results: They shouldn't
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem. Please update to this new kernel, and
report whether or not it fixes your problem.
If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.
This bug has been automatically closed as part of a mass update.
It had been in NEEDINFO state since July 2005.
If this bug still exists in current errata kernels, please reopen this bug.
There are a large number of inactive bugs in the database, and this is the only
way to purge them.