Bug 481506
Summary: | SATA-attached drives give spurious errors (Nvidia nForce 630a chipset) | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | David Tonhofer <bughunt> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 10 | CC: | kernel-maint, quintela |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-01-31 22:08:48 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Tonhofer
2009-01-25 19:49:34 UTC
So far, no indication that something is wrong with the hardware. On second thoughts, it may be the F10 software RAID itself which gives problems? Will test this on a newly installed system. Hmm, upstream bug like this one turned out to be bad memory: http://bugzilla.kernel.org/show_bug.cgi?id=12084 Thanks Chuck, I will replace the memory ASAP, with ECC one, too. I hear it's cheap right now.... Anyway, tested writing/reading a large file on software RAID and not on software RAID with a freshly installed Fedora 10. A big file of zeros or random numbers starts to show errors at around 3'000'000 blocks. Will run memtest for the remainder of the day. Test code: ------ runit.sh ------------------------ #!/bin/sh # Errors at: # 2'880'000 blocks with urandom # 2'870'000 blocks with urandom # 2'890'000 blocks with zero # 3'000'000 blocks with zero BLOCKS=2880000 ERROR=0 #SOURCE=/dev/urandom SOURCE=/dev/zero OUTFILE=entropy while [[ $ERROR == 0 ]]; do echo -n "Testing $BLOCKS blocks at " date dd if=$SOURCE of=$OUTFILE count=$BLOCKS perl hasher.pl $OUTFILE > /tmp/HASHES perl hasher.pl $OUTFILE > /tmp/HASHES2 diff /tmp/HASHES /tmp/HASHES2 if [[ $? == 1 ]]; then ERROR=1 echo "Errors found at $BLOCKS blocks" else let BLOCKS=$BLOCKS+10000 fi done ------- hasher.pl ------------------------ #!/usr/bin/perl use Digest::MD5 qw(md5 md5_hex md5_base64); $file = $ARGV[0]; open(FILE,$file) or die "Could not open $file: $!\n"; binmode FILE; my $buffer; my $size = 1024*10; my $read; my $pos = 0; while (($read = read(FILE, $buffer, $size)) > 0) { my $digest = md5_hex($buffer); my $end = $pos + $size - 1; print "$digest [$pos,$end] $size\n"; $pos += $size; } close(FILE); memtest ran for 32 passes with no errors Will try what happens on Ubuntu server next. Running the above "dump-and-check" tests in Ubuntu 8.10 server in recovery mode also yields error, though only with files reaching 2G size (~RAM sized). I tested with swap off and a RAID mirror with only one device left - no improvement. Several passes of "hexdump" on the same file conistently yield the same result (ditto for md5sum). This seems to indicate that it's writing the file which fails, not reading it. For example, a zeros-only file of ~1.9 GB size yields: --------------- 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 16fcf40 0000 0000 0000 0000 0020 0000 0000 0000 16fcf50 0000 0000 0000 0000 0000 0000 0000 0000 16fcf60 0000 0000 0000 0000 0024 0000 0000 0000 16fcf70 0000 0000 0000 0000 0000 0000 0000 0000 16fcf80 0000 0000 0000 0000 0024 0000 0000 0000 16fcf90 0000 0000 0000 0000 0000 0000 0000 0000 16fcfa0 0000 0000 0000 0000 0024 0000 0000 0000 16fcfb0 0000 0000 0000 0000 0000 0000 0000 0000 16fcfc0 0000 0000 0000 0000 0020 0000 0000 0000 16fcfd0 0000 0000 0000 0000 0000 0000 0000 0000 16fcfe0 0000 0000 0000 0000 0031 0000 0000 0000 16fcff0 0000 0000 0000 0000 0000 0000 0000 0000 * 4df0a660 0000 0000 0000 0000 0004 0000 0000 0000 4df0a670 0000 0000 0000 0000 0000 0000 0000 0000 4df0a680 0000 0000 0000 0000 0020 0000 0000 0000 4df0a690 0000 0000 0000 0000 0000 0000 0000 0000 4df0a6a0 0000 0000 0000 0000 0004 0000 0000 0000 4df0a6b0 0000 0000 0000 0000 0000 0000 0000 0000 * 4df0a6e0 0000 0000 0000 0000 0021 0000 0000 0000 4df0a6f0 0000 0000 0000 0000 0000 0000 0000 0000 4df0a700 0000 0000 0000 0000 0024 0000 0000 0000 4df0a710 0000 0000 0000 0000 0000 0000 0000 0000 4df0a720 0000 0000 0000 0000 0020 0000 0000 0000 4df0a730 0000 0000 0000 0000 0000 0000 0000 0000 4df0a740 0000 0000 0000 0000 0020 0000 0000 0000 4df0a750 0000 0000 0000 0000 0000 0000 0000 0000 4df0a760 0000 0000 0000 0000 0030 0000 0000 0000 4df0a770 0000 0000 0000 0000 0000 0000 0000 0000 4df0a780 0000 0000 0000 0000 0024 0000 0000 0000 4df0a790 0000 0000 0000 0000 0000 0000 0000 0000 * 4df0a7c0 0000 0000 0000 0000 0026 0000 0000 0000 4df0a7d0 0000 0000 0000 0000 0000 0000 0000 0000 4df0a7e0 0000 0000 0000 0000 0037 0000 0000 0000 4df0a7f0 0000 0000 0000 0000 0000 0000 0000 0000 4df0a800 0000 0000 0000 0000 0027 0000 0000 0000 4df0a810 0000 0000 0000 0000 0000 0000 0000 0000 4df0a820 0000 0000 0000 0000 0038 0000 0000 0000 4df0a830 0000 0000 0000 0000 0000 0000 0000 0000 * 73f78000 --------------- This is a bit scary. I wonder whether I should replace the motherboard. Or the CPU, or maybe just the SATA cables? Oh sh*t. I will try this with the PATA subsystem, and just the SATA raw device as a last test. A fitting comment by "Arno Wagner" in an old thread: ------------- http://osdir.com/ml/linux.kernel.device-mapper.dm-crypt/2004-05/msg00038.html Yes. My guess now would be that the mainboard is not well-engineered. Maybe some parallell signal-lines are not exactly the same length or some lines are too long. Maybe the chipset is not cooled well. Maybe they did not use enough decoupling capacitors. Any chance of getting another CPU and testing with that? Any change of getting other memory and testing with that? Of course the fundamental problem is that these systems have gotten far to complex. I remember owning a 386SX mainboard that had 6 semiconductors on it, including the CPU and one LED. ------------- It's the memory for sure, so this bug is INVALID. I just tested both of the RAM modules in the system separately, and one of them results in errors, the other not. memtest86 still finds no problems -- it will try with the latest version. The script to dump a zero-only file and check its contents has been refined/reduced to: --------------------- #!/bin/bash BLOCKS=4200000 ERROR=0 SOURCE=/dev/zero OUTFILE=entropy BLOCKSSTEP=10000 # Write initial file full of zeros echo "Writing initial file of $BLOCKS blocks" dd if=$SOURCE of=$OUTFILE count=$BLOCKS conv=fdatasync while [[ $ERROR == 0 ]]; do echo -n "Testing $BLOCKS blocks at " date # hexdump the file full of zeros HEXDUMP=hexdump_`date +%Y%m%d_%H%M%S` hexdump $OUTFILE > $HEXDUMP # if the hexdump contains only zeros, then... LINE=`cat $HEXDUMP | cut --fields=1- --delimiter=" " --only-delimited` if [[ $LINE != "0000000 0000 0000 0000 0000 0000 0000 0000 0000" ]]; then ERROR=1 echo "Errors found in $HEXDUMP ... dumping a second time" hexdump $OUTFILE > ${HEXDUMP}_repeat else let BLOCKS=$BLOCKS+$BLOCKSSTEP dd if=$SOURCE of=$OUTFILE \ count=$BLOCKSSTEP \ conv=fdatasync,notrunc \ status=noxfer \ oflag=append fi done --------------------- Thus, we start with a file with only zeros and add zeros to its end until "hexdump" says that there is a nonzero content. "hexdump" is then applied a second time for good measure, generally resulting in an output that indicates that the file contains indeed, only zeros. That will teach me to use non-ECC RAM. |